Why This Stack
The defining constraint was: fast iteration on prompts and model behavior, not fast infrastructure setup. Modal handles compute spikes without requiring a Kubernetes cluster — you write Python and it runs wherever it needs to run. Fly.io for the long-running service layer. Postgres for everything that needs to persist. The Claude API was chosen over OpenAI at the start of this project after a two-day eval; the structured output quality was the deciding factor for our extraction use case.
The Stack
| Tool | Role | Cost / Tier |
|---|---|---|
| Cursor | AI-assisted code editor (VS Code fork) | $20/mo Pro |
| Claude API (claude-sonnet-4-20250514) | LLM inference, primary model | Usage-based, ~$3-15/mo at our scale |
| Python 3.12 | Primary language for AI workloads | Free |
| Modal | Serverless GPU/CPU compute for ML workloads | Usage-based, ~$0.10/GPU-hour |
| Postgres (Fly.io managed) | Primary database | $0/mo free tier |
| Fly.io | Long-running service deployment | $5-20/mo |
The Prompt Iteration Loop
The actual day-to-day was: edit prompt in a Jupyter notebook, run against 20 test cases from a fixture file, evaluate outputs, iterate. The fixture-based evaluation setup was built on day two and paid off on every subsequent day. Build this first.
Modal made the "run this on 500 documents in parallel" problem trivial. We were processing batches that would have timed out on a single machine; Modal fanned them out transparently. The billing for occasional heavy compute was acceptable at our scale.
I'd add structured output validation earlier with Pydantic. We spent time debugging malformed model outputs that a schema would have caught immediately. Also: Modal's cold start time on GPU instances is real. For latency-sensitive inference, you need the keep-warm option, which costs money even at idle.