On Local LLMs for Actual Work

We ran Ollama with four models for six weeks on two machines. The gap between 'it works' and 'it's useful' is still significant — but it's closing.

I have reasons to use local LLMs that go beyond privacy. I have client contracts with data handling clauses. I sometimes work offline. And honestly, after years of watching inference costs collapse, I'm curious whether the per-request economics of hosted models are still the obvious choice for developer tooling.

So we ran an experiment. Six weeks, four models, two machines, real workloads.

The Setup

Hardware: an M3 Pro MacBook Pro with 36GB unified memory and an AMD Ryzen 7 workstation with 64GB RAM and an RTX 4070. Ollama as the inference server — it's the least-friction option and the default choice for most people approaching this.

Models tested: llama3.2:8b, qwen2.5-coder:14b, mistral:7b, and deepseek-coder-v2:16b. All quantized to Q4_K_M where available.

Workloads: code completion in Neovim via llm.nvim, ad hoc text processing from the terminal, and structured data extraction from documents.

What Actually Worked

Code completion was the biggest surprise. qwen2.5-coder:14b on the M3 Pro is genuinely useful. Not GPT-4o useful — the context window fills faster, multi-file awareness requires manual feeding, and it occasionally hallucinates library APIs. But for single-file editing, it's competent.

Latency is acceptable: roughly 25-40 tokens/second on the M3 Pro for the 14b model. That's below what feels instantaneous, but not slow enough to break flow if you're treating it as an assistant rather than expecting real-time completion.

Terminal text processing — summarizing logs, rewriting commit messages, extracting structured data from paste — was the most friction-free use case. Pipe text in, get text out. The models are well-suited to this and the task has no real latency sensitivity.

What Didn't Work

Complex multi-file reasoning. We tried it. The models lost context or made contradictory edits across files. This is a fundamental limitation of the available context windows and the fact that we weren't RAG-indexing the codebase — which is a solvable problem, but one that requires infrastructure the average developer isn't running locally.

Anything requiring up-to-date knowledge. The quantized models we ran have training cutoffs that predate several libraries we use actively. They hallucinate recent APIs with complete confidence.

The Ryzen workstation with the 4070 was actually worse than the M3 Pro for the 14b models we tested, because the models didn't fully fit in VRAM and the CPU fallback was slow. The unified memory architecture of Apple Silicon is genuinely advantageous here, not just marketing.

The Economics Argument

This is where it gets interesting. Hosted API costs for developer tooling have fallen significantly. At current rates, moderate daily use of a hosted model is maybe $8-15/month depending on the model and usage pattern. Local inference costs: electricity (measurable but low) and the hardware (already owned).

The honest answer is that for most developers, hosted inference is cheaper than local inference if you factor in the setup time and the quality gap. The privacy argument is real. The economics argument is weaker than it looks.

Where We Land

Local LLMs are genuinely useful for: air-gapped work, document processing pipelines where you own the data, and offline coding assistance where the quality bar is "better than nothing." They are not a drop-in replacement for hosted frontier models on tasks that require broad knowledge or multi-file context.

This is a gap that will close. The trajectory is clear. But "closing" and "closed" are different, and as of May 2026, there's still a meaningful quality difference on the workloads we tested.