Running Local LLMs for Development: A Practical Guide
A deep dive into setting up local language models for code generation, review, and testing.
Running Local LLMs for Development: A Practical Guide
Cloud-based LLMs are powerful, but there are real reasons to run models locally: privacy, latency, cost, and the ability to work offline. I have been running local models for development tasks for a few months now, and I want to share what actually works.
Setting Up Ollama
Ollama is the easiest way to get started. Install it, pull a model, and you are running inference locally in under five minutes.
# Install on macOS
brew install ollama
# Start the server
ollama serve
# Pull a code-focused model
ollama pull codellama:13b
ollama pull deepseek-coder-v2:16b
ollama pull mistral:7bThe API is dead simple. It runs on localhost:11434 and speaks a straightforward REST interface.
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-coder-v2:16b",
"prompt": "Write a TypeScript function that debounces an input handler",
"stream": false
}'Model Comparison: What I Actually Use
I have tested several models for different tasks. Here is where I landed.
DeepSeek Coder V2 (16B) is my default for code generation. It handles TypeScript well, understands React patterns, and produces clean output. The 16B parameter version runs comfortably on an M-series Mac with 32GB RAM.
Mistral 7B is fast. Really fast. I use it for quick tasks like generating commit messages, writing JSDoc comments, or reformatting data. The quality is lower than larger models, but for simple transformations it does not matter.
CodeLlama 13B was my first pick but I have mostly moved away from it. DeepSeek Coder handles the same tasks with better output quality.
Phi-3 (3.8B) is surprisingly capable for its size. I use it on my laptop when I am traveling and want something that does not demolish battery life. It handles simple completions and explanations well enough.
Latency vs Quality: The Real Tradeoff
Here is the thing nobody talks about. Local models are faster for small requests. There is no network round trip, no queue, no rate limit. For a quick "generate a type from this JSON" task, a local 7B model responds in under a second. A cloud API takes 2-3 seconds minimum once you factor in network latency.
But for complex tasks -- multi-file refactors, architecture suggestions, debugging subtle issues -- cloud models still win. The quality gap is real. I tried using local models for PR review and the feedback was too shallow to be useful.
My workflow splits like this:
| Task | Model | Why |
|---|---|---|
| Quick completions | Mistral 7B (local) | Speed matters more than depth |
| Code generation | DeepSeek Coder 16B (local) | Good quality, no API costs |
| Complex refactoring | Cloud API | Quality gap is too large |
| Sensitive code | Any local model | Data stays on my machine |
Integrating With Your Editor
I use Continue.dev as a VS Code extension to route requests to my local Ollama instance. The configuration is straightforward.
{
"models": [
{
"title": "DeepSeek Local",
"provider": "ollama",
"model": "deepseek-coder-v2:16b"
}
],
"tabAutocompleteModel": {
"title": "Mistral Fast",
"provider": "ollama",
"model": "mistral:7b"
}
}Tab autocomplete with a local 7B model feels instant. That alone is worth the setup.
When Local Beats Cloud
Run local when: you are on a plane, you are working with proprietary code, you want zero-cost completions during a long coding session, or you need sub-second responses for simple tasks.
Stick with cloud when: you need the best possible output quality, you are doing complex multi-step reasoning, or you need a massive context window.
The best setup is both. Local for the fast, frequent, simple stuff. Cloud for the hard problems. That split has cut my API costs significantly while keeping my velocity high.