LLM Code Completion: Speed vs Quality — Benchmark Lab

Four models, 120 prompts, real latency numbers. How does token generation speed correlate with output quality for developer use cases? Not much — which is the interesting finding.

Latency Results

Model	Median Tokens/s	P95 Latency (ms)	Time to First Token (ms)	Avg Response Length
claude-sonnet-4	82	3,200	310	187 tokens
gpt-4o	74	3,800	380	201 tokens
gemini-2.0-flash	71	2,900	420	174 tokens
deepseek-v3	68	4,100	510	211 tokens
Internal testing · May 2026 · These are our measurements. Run your own.

Quality Results

Model	Correctness (1-5)	Style Adherence (1-5)	Completeness (1-5)	Overall Avg
claude-sonnet-4	4.4	4.2	4.3	4.3
gpt-4o	4.2	4.0	4.1	4.1
gemini-2.0-flash	3.8	3.7	3.9	3.8
deepseek-v3	3.9	3.6	4.0	3.8
Evaluator blind to model identity. 1-5 scale, two evaluators, averaged.

What We Found

The correlation between tokens/second and quality score was near zero across our test set. The fastest models were not the worst ones; the slowest was not the best. Latency and quality appear to be roughly independent in the current generation of frontier models — which means optimizing for speed doesn't necessarily mean accepting quality loss.

The Gemini latency advantage (best P95) is real but the quality gap is also real. Whether that trade-off makes sense depends entirely on the use case. For background completions where latency matters, Gemini may be correct. For complex refactoring where quality matters more, the numbers suggest otherwise.

DeepSeek v3's quality scores were competitive with GPT-4o on correctness and completeness but weaker on style adherence to existing code — it tended to introduce its own idioms. For greenfield work this may not matter; for codebase-style-consistent completions it was notable.

Caveats

120 prompts is a small sample. TypeScript only. One evaluator pair. The "quality" scores are subjective and reflect our team's preferences. Models are updated continuously; these results may not hold at the time you're reading this. These are our measurements. Run your own.