Latency Results
| Model | Median Tokens/s | P95 Latency (ms) | Time to First Token (ms) | Avg Response Length |
|---|---|---|---|---|
| claude-sonnet-4 | 82 | 3,200 | 310 | 187 tokens |
| gpt-4o | 74 | 3,800 | 380 | 201 tokens |
| gemini-2.0-flash | 71 | 2,900 | 420 | 174 tokens |
| deepseek-v3 | 68 | 4,100 | 510 | 211 tokens |
| Internal testing · May 2026 · These are our measurements. Run your own. | ||||
Quality Results
| Model | Correctness (1-5) | Style Adherence (1-5) | Completeness (1-5) | Overall Avg |
|---|---|---|---|---|
| claude-sonnet-4 | 4.4 | 4.2 | 4.3 | 4.3 |
| gpt-4o | 4.2 | 4.0 | 4.1 | 4.1 |
| gemini-2.0-flash | 3.8 | 3.7 | 3.9 | 3.8 |
| deepseek-v3 | 3.9 | 3.6 | 4.0 | 3.8 |
| Evaluator blind to model identity. 1-5 scale, two evaluators, averaged. | ||||
What We Found
The correlation between tokens/second and quality score was near zero across our test set. The fastest models were not the worst ones; the slowest was not the best. Latency and quality appear to be roughly independent in the current generation of frontier models — which means optimizing for speed doesn't necessarily mean accepting quality loss.
The Gemini latency advantage (best P95) is real but the quality gap is also real. Whether that trade-off makes sense depends entirely on the use case. For background completions where latency matters, Gemini may be correct. For complex refactoring where quality matters more, the numbers suggest otherwise.
DeepSeek v3's quality scores were competitive with GPT-4o on correctness and completeness but weaker on style adherence to existing code — it tended to introduce its own idioms. For greenfield work this may not matter; for codebase-style-consistent completions it was notable.
Caveats
120 prompts is a small sample. TypeScript only. One evaluator pair. The "quality" scores are subjective and reflect our team's preferences. Models are updated continuously; these results may not hold at the time you're reading this. These are our measurements. Run your own.