Running LLMs Locally: Ollama, LM Studio, llama.cpp

The ultimate power move in AI is cutting the cord. Running models locally means:

  1. Total Privacy: No data leaves your machine.
  2. Zero Cost: No monthly subscriptions or token fees.
  3. Offline Access: AI on a plane, in a submarine, or in a bunker.

The Hardware Requirements

You don’t need an H100 cluster.

  • Mac: Apple Silicon (M1/M2/M3) is the king of local inference because of Unified Memory. If you have 16GB+ RAM, you are golden.
  • PC: You need an NVIDIA GPU. 8GB VRAM (RTX 3060/4060) is the minimum for decent performance. 24GB (RTX 3090/4090) is the sweet spot.
  • CPU Only: Possible, but slow.

The Tools

1. Ollama (The Easiest)

Ollama is the “Docker for LLMs.” It makes running a model one command.

Install: Download from ollama.com Run:

ollama run llama3

That’s it. It downloads the weights, sets up the server, and drops you into a chat.

2. LM Studio (The UI)

If you prefer a nice graphical interface (like ChatGPT) but running locally.

  • Drag and drop models.
  • Chat history.
  • “Local Server” mode to mimic OpenAI’s API.

3. llama.cpp (The Engine)

This is the low-level C++ library that powers almost everything else. It introduced GGUF, the file format that compresses giant models to run on consumer hardware.

Quantization: The Secret Sauce

How do we fit a 70GB model onto a 16GB laptop? Quantization.

We reduce the precision of the numbers in the model’s brain.

  • FP16 (16-bit): Original size.
  • Q4_K_M (4-bit): ~25% of the size, with 99% of the intelligence.

A “Q4” quantized model is indistinguishable from the full model for most tasks but runs 4x faster and uses 1/4 the RAM.

The “Local API” Pattern

Most local tools expose an OpenAI-compatible API. This means you can use your existing code!

Original Code:

client = OpenAI(api_key="sk-...")

Local Code:

client = OpenAI(
    base_url="http://localhost:11434/v1", # Ollama
    api_key="ollama"
)

Now your app runs on your hardware, for free.

Conclusion

We are entering the era of Personal AI. Not a giant brain in the cloud owned by a corporation, but a smart, private brain on your desk that works for you.

Start with Ollama. Download Llama 3. Welcome to the resistance.


Next: Go build something.