bash Local AI February 8, 2026

Ollama Quickstart

Run LLMs locally with Ollama. Complete setup guide with model downloads, API usage, and integration examples. Privacy-first AI in minutes.

ollamalocal-llmprivacyself-hostedinference

Description

Run powerful AI models on your own hardware. No API keys, no cloud costs, complete privacy. Ollama makes local LLM deployment trivial.

Installation

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download/windows

# Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Quick Start

# Pull and run a model (one command)
ollama run llama3.2

# You're now in an interactive chat!
# Type your message and press Enter

>>> What is the capital of France?
The capital of France is Paris...

>>> /bye  # Exit

Essential Models

# ════════════════════════════════════════════════════════════════════════════
# Small & Fast (4-8GB RAM)
# ════════════════════════════════════════════════════════════════════════════

ollama pull llama3.2          # 3B, great all-rounder
ollama pull phi3:mini         # 3.8B, Microsoft's compact model
ollama pull gemma2:2b         # 2B, Google's lightweight

# ════════════════════════════════════════════════════════════════════════════
# Medium (16GB RAM)
# ════════════════════════════════════════════════════════════════════════════

ollama pull llama3.2:8b       # 8B, excellent quality
ollama pull mistral           # 7B, fast and capable
ollama pull gemma2:9b         # 9B, strong reasoning

# ════════════════════════════════════════════════════════════════════════════
# Large (32GB+ RAM)
# ════════════════════════════════════════════════════════════════════════════

ollama pull llama3.3:70b      # 70B, near GPT-4 quality
ollama pull qwen2.5:72b       # 72B, multilingual beast
ollama pull deepseek-r1:70b   # 70B, reasoning model

# ════════════════════════════════════════════════════════════════════════════
# Specialized
# ════════════════════════════════════════════════════════════════════════════

ollama pull codellama         # Code generation
ollama pull llava             # Vision + text (describe images)
ollama pull nomic-embed-text  # Embeddings for RAG

Model Management

# List downloaded models
ollama list

# Show model details
ollama show llama3.2

# Remove a model
ollama rm llama3.2

# Copy/rename a model
ollama cp llama3.2 my-llama

# Check running models
ollama ps

API Usage

Ollama exposes an OpenAI-compatible API on port 11434.

curl

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

# Chat format
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

# Embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

Python

# Using requests
import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "llama3.2",
        "prompt": "Explain quantum computing in simple terms",
        "stream": False
    }
)
print(response.json()["response"])

# Using OpenAI SDK (compatible!)
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Any string works
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

JavaScript

// Using fetch
const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "llama3.2",
    prompt: "Write a haiku about coding",
    stream: false
  })
});

const data = await response.json();
console.log(data.response);

Custom Models (Modelfile)

Create specialized models with custom system prompts:

# Modelfile
FROM llama3.2

# Set the temperature
PARAMETER temperature 0.7

# Set the system prompt
SYSTEM """
You are a helpful coding assistant. You write clean, well-documented code.
Always explain your reasoning before providing code.
Use type hints in Python and TypeScript.
"""

# Create the custom model
ollama create code-helper -f Modelfile

# Use it
ollama run code-helper

Docker Compose Integration

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  # Example: Open WebUI for chat interface
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    volumes:
      - open_webui_data:/app/backend/data

volumes:
  ollama_data:
  open_webui_data:

Performance Tips

# ════════════════════════════════════════════════════════════════════════════
# GPU Acceleration
# ════════════════════════════════════════════════════════════════════════════

# Check GPU detection
ollama run llama3.2 --verbose 2>&1 | grep -i gpu

# Force CPU only (if GPU issues)
CUDA_VISIBLE_DEVICES="" ollama serve

# ════════════════════════════════════════════════════════════════════════════
# Memory Management
# ════════════════════════════════════════════════════════════════════════════

# Keep models loaded (faster subsequent requests)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "keep_alive": "1h"
}'

# Unload model immediately after response
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "keep_alive": 0
}'

# ════════════════════════════════════════════════════════════════════════════
# Context Length
# ════════════════════════════════════════════════════════════════════════════

# Increase context window (uses more VRAM)
ollama run llama3.2 --num-ctx 8192

Troubleshooting

# Check if Ollama is running
curl http://localhost:11434/api/tags

# View logs
journalctl -u ollama -f  # Linux systemd
ollama logs              # macOS

# Restart service
sudo systemctl restart ollama  # Linux
brew services restart ollama   # macOS

# Reset everything
rm -rf ~/.ollama
ollama serve

Model Comparison Cheat Sheet

Model	Size	VRAM	Speed	Quality	Best For
llama3.2	3B	4GB	⚡⚡⚡	★★★	Quick tasks
llama3.2:8b	8B	8GB	⚡⚡	★★★★	General use
mistral	7B	8GB	⚡⚡⚡	★★★★	Fast + good
codellama	7B	8GB	⚡⚡	★★★★	Coding
llama3.3:70b	70B	48GB	⚡	★★★★★	Best quality

Next Steps

Add a web UI: Open WebUI
Build a RAG system: See our RAG Starter Kit
Fine-tune: Export and use with our LoRA guide
Benchmark: Compare models with our LLM Playground

← Back to scripts