Back to Blog
·4 min read

Running Local LLMs for Development: A Practical Guide

A deep dive into setting up local language models for code generation, review, and testing.

AI Dev
LLM
local
ollama

Running Local LLMs for Development: A Practical Guide

Cloud-based LLMs are powerful, but there are real reasons to run models locally: privacy, latency, cost, and the ability to work offline. I have been running local models for development tasks for a few months now, and I want to share what actually works.

Setting Up Ollama

Ollama is the easiest way to get started. Install it, pull a model, and you are running inference locally in under five minutes.

# Install on macOS
brew install ollama
 
# Start the server
ollama serve
 
# Pull a code-focused model
ollama pull codellama:13b
ollama pull deepseek-coder-v2:16b
ollama pull mistral:7b

The API is dead simple. It runs on localhost:11434 and speaks a straightforward REST interface.

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-coder-v2:16b",
  "prompt": "Write a TypeScript function that debounces an input handler",
  "stream": false
}'

Model Comparison: What I Actually Use

I have tested several models for different tasks. Here is where I landed.

DeepSeek Coder V2 (16B) is my default for code generation. It handles TypeScript well, understands React patterns, and produces clean output. The 16B parameter version runs comfortably on an M-series Mac with 32GB RAM.

Mistral 7B is fast. Really fast. I use it for quick tasks like generating commit messages, writing JSDoc comments, or reformatting data. The quality is lower than larger models, but for simple transformations it does not matter.

CodeLlama 13B was my first pick but I have mostly moved away from it. DeepSeek Coder handles the same tasks with better output quality.

Phi-3 (3.8B) is surprisingly capable for its size. I use it on my laptop when I am traveling and want something that does not demolish battery life. It handles simple completions and explanations well enough.

Latency vs Quality: The Real Tradeoff

Here is the thing nobody talks about. Local models are faster for small requests. There is no network round trip, no queue, no rate limit. For a quick "generate a type from this JSON" task, a local 7B model responds in under a second. A cloud API takes 2-3 seconds minimum once you factor in network latency.

But for complex tasks -- multi-file refactors, architecture suggestions, debugging subtle issues -- cloud models still win. The quality gap is real. I tried using local models for PR review and the feedback was too shallow to be useful.

My workflow splits like this:

Task Model Why
Quick completions Mistral 7B (local) Speed matters more than depth
Code generation DeepSeek Coder 16B (local) Good quality, no API costs
Complex refactoring Cloud API Quality gap is too large
Sensitive code Any local model Data stays on my machine

Integrating With Your Editor

I use Continue.dev as a VS Code extension to route requests to my local Ollama instance. The configuration is straightforward.

{
  "models": [
    {
      "title": "DeepSeek Local",
      "provider": "ollama",
      "model": "deepseek-coder-v2:16b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Mistral Fast",
    "provider": "ollama",
    "model": "mistral:7b"
  }
}

Tab autocomplete with a local 7B model feels instant. That alone is worth the setup.

When Local Beats Cloud

Run local when: you are on a plane, you are working with proprietary code, you want zero-cost completions during a long coding session, or you need sub-second responses for simple tasks.

Stick with cloud when: you need the best possible output quality, you are doing complex multi-step reasoning, or you need a massive context window.

The best setup is both. Local for the fast, frequent, simple stuff. Cloud for the hard problems. That split has cut my API costs significantly while keeping my velocity high.