Ollama Quickstart
Run LLMs locally with Ollama. Complete setup guide with model downloads, API usage, and integration examples. Privacy-first AI in minutes.
Description
Run powerful AI models on your own hardware. No API keys, no cloud costs, complete privacy. Ollama makes local LLM deployment trivial.
Installation
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download/windows
# Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Quick Start
# Pull and run a model (one command)
ollama run llama3.2
# You're now in an interactive chat!
# Type your message and press Enter
>>> What is the capital of France?
The capital of France is Paris...
>>> /bye # Exit
Essential Models
# ════════════════════════════════════════════════════════════════════════════
# Small & Fast (4-8GB RAM)
# ════════════════════════════════════════════════════════════════════════════
ollama pull llama3.2 # 3B, great all-rounder
ollama pull phi3:mini # 3.8B, Microsoft's compact model
ollama pull gemma2:2b # 2B, Google's lightweight
# ════════════════════════════════════════════════════════════════════════════
# Medium (16GB RAM)
# ════════════════════════════════════════════════════════════════════════════
ollama pull llama3.2:8b # 8B, excellent quality
ollama pull mistral # 7B, fast and capable
ollama pull gemma2:9b # 9B, strong reasoning
# ════════════════════════════════════════════════════════════════════════════
# Large (32GB+ RAM)
# ════════════════════════════════════════════════════════════════════════════
ollama pull llama3.3:70b # 70B, near GPT-4 quality
ollama pull qwen2.5:72b # 72B, multilingual beast
ollama pull deepseek-r1:70b # 70B, reasoning model
# ════════════════════════════════════════════════════════════════════════════
# Specialized
# ════════════════════════════════════════════════════════════════════════════
ollama pull codellama # Code generation
ollama pull llava # Vision + text (describe images)
ollama pull nomic-embed-text # Embeddings for RAG
Model Management
# List downloaded models
ollama list
# Show model details
ollama show llama3.2
# Remove a model
ollama rm llama3.2
# Copy/rename a model
ollama cp llama3.2 my-llama
# Check running models
ollama ps
API Usage
Ollama exposes an OpenAI-compatible API on port 11434.
curl
# Generate completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
# Chat format
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
# Embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Hello world"
}'
Python
# Using requests
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "llama3.2",
"prompt": "Explain quantum computing in simple terms",
"stream": False
}
)
print(response.json()["response"])
# Using OpenAI SDK (compatible!)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Any string works
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
JavaScript
// Using fetch
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama3.2",
prompt: "Write a haiku about coding",
stream: false
})
});
const data = await response.json();
console.log(data.response);
Custom Models (Modelfile)
Create specialized models with custom system prompts:
# Modelfile
FROM llama3.2
# Set the temperature
PARAMETER temperature 0.7
# Set the system prompt
SYSTEM """
You are a helpful coding assistant. You write clean, well-documented code.
Always explain your reasoning before providing code.
Use type hints in Python and TypeScript.
"""
# Create the custom model
ollama create code-helper -f Modelfile
# Use it
ollama run code-helper
Docker Compose Integration
# docker-compose.yml
services:
ollama:
image: ollama/ollama
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Example: Open WebUI for chat interface
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
- open_webui_data:/app/backend/data
volumes:
ollama_data:
open_webui_data:
Performance Tips
# ════════════════════════════════════════════════════════════════════════════
# GPU Acceleration
# ════════════════════════════════════════════════════════════════════════════
# Check GPU detection
ollama run llama3.2 --verbose 2>&1 | grep -i gpu
# Force CPU only (if GPU issues)
CUDA_VISIBLE_DEVICES="" ollama serve
# ════════════════════════════════════════════════════════════════════════════
# Memory Management
# ════════════════════════════════════════════════════════════════════════════
# Keep models loaded (faster subsequent requests)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"keep_alive": "1h"
}'
# Unload model immediately after response
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"keep_alive": 0
}'
# ════════════════════════════════════════════════════════════════════════════
# Context Length
# ════════════════════════════════════════════════════════════════════════════
# Increase context window (uses more VRAM)
ollama run llama3.2 --num-ctx 8192
Troubleshooting
# Check if Ollama is running
curl http://localhost:11434/api/tags
# View logs
journalctl -u ollama -f # Linux systemd
ollama logs # macOS
# Restart service
sudo systemctl restart ollama # Linux
brew services restart ollama # macOS
# Reset everything
rm -rf ~/.ollama
ollama serve
Model Comparison Cheat Sheet
| Model | Size | VRAM | Speed | Quality | Best For |
|---|---|---|---|---|---|
| llama3.2 | 3B | 4GB | ⚡⚡⚡ | ★★★ | Quick tasks |
| llama3.2:8b | 8B | 8GB | ⚡⚡ | ★★★★ | General use |
| mistral | 7B | 8GB | ⚡⚡⚡ | ★★★★ | Fast + good |
| codellama | 7B | 8GB | ⚡⚡ | ★★★★ | Coding |
| llama3.3:70b | 70B | 48GB | ⚡ | ★★★★★ | Best quality |
Next Steps
- Add a web UI: Open WebUI
- Build a RAG system: See our RAG Starter Kit
- Fine-tune: Export and use with our LoRA guide
- Benchmark: Compare models with our LLM Playground