Retrieval-Augmented Generation (RAG) Explained

Large language models are impressive — until they hallucinate a citation, confidently tell you about a 2022 event that never happened, or fail to answer a question about your internal codebase. The solution? Give the model access to real, relevant information at query time. That’s RAG.

The Core Problem

LLMs are trained on static snapshots of data. Once training is done, the model’s knowledge is frozen. Ask it about yesterday’s news, your private documents, or anything proprietary — and you’ll get either a hallucinated answer or a polite “I don’t know.”

Fine-tuning helps, but it’s expensive, slow, and still doesn’t handle real-time data. You’d need to retrain every time something changes.

How RAG Works

RAG splits the problem into two stages:

1. Retrieval

Before the LLM generates an answer, a retrieval system fetches the most relevant documents or chunks from an external knowledge base. This base can be anything: PDFs, wikis, databases, code repositories, emails.

The retrieval step typically uses semantic search via embeddings:

  • Documents are chunked and embedded into vectors
  • The user’s query is also embedded
  • Cosine similarity (or approximate nearest neighbor search) finds the closest chunks

2. Augmented Generation

The retrieved chunks are injected into the LLM’s context window alongside the original query. The model now “sees” the relevant information and generates an answer grounded in actual data — not just its training weights.

User query → Embed → Vector DB → Top-K chunks

                              [query + chunks] → LLM → Answer

Key Components

Embedding Model

Converts text to dense vectors. Popular choices: text-embedding-3-small (OpenAI), nomic-embed-text (local), bge-m3 (multilingual).

Vector Database

Stores and indexes embeddings for fast similarity search. Options: Qdrant, Weaviate, Chroma, Pinecone, pgvector (Postgres extension).

Chunking Strategy

How you split documents matters enormously:

  • Fixed-size chunks (e.g., 512 tokens) — simple but can cut context mid-sentence
  • Semantic chunks — split on paragraph/section boundaries
  • Hierarchical — parent document + child chunks for better context

Reranking

After initial retrieval, a cross-encoder reranker scores retrieved chunks against the query more precisely. This two-stage approach (fast retrieval + precise reranking) significantly improves quality.

A Simple RAG Pipeline

from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Embed documents
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")
docs = ["RAG combines retrieval with generation.", "Vector search finds similar documents."]
embeddings = model.encode(docs)

# 2. At query time
query = "How does RAG work?"
query_emb = model.encode([query])

# 3. Find most similar chunk (cosine similarity)
scores = np.dot(embeddings, query_emb.T).flatten()
best_idx = scores.argmax()
context = docs[best_idx]

# 4. Build prompt
prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
# → send to LLM

RAG vs Fine-Tuning

RAGFine-tuning
Update costCheap (re-index docs)Expensive (retrain)
Real-time data✅ Yes❌ No
Private data✅ Yes✅ Yes (but static)
Factual accuracyHigh (grounded)Variable
LatencyHigher (retrieval step)Lower

RAG wins when your knowledge changes frequently. Fine-tuning wins when you need the model to learn new behaviors or styles.

Common Pitfalls

Chunking too aggressively — 100-token chunks lose context. Aim for 300–600 tokens with overlap.

Not filtering by relevance — Always check similarity scores. If the best match is below a threshold, don’t inject garbage context.

Ignoring metadata — Filter by date, document type, or source before retrieval. A year-old policy doc might be worse than no context at all.

Missing reranking — Vector similarity is approximate. A reranker (Cohere Rerank, BGE reranker) dramatically improves precision.

Advanced Patterns

Hypothetical Document Embedding (HyDE): Generate a fake answer first, embed it, use it to retrieve real docs. Counterintuitive but effective.

Multi-query RAG: Generate multiple phrasings of the query, retrieve for each, then deduplicate. Reduces sensitivity to query wording.

Agentic RAG: Let the LLM decide when to retrieve, what to search for, and whether to retrieve again based on partial answers. More powerful, more complex.

Conclusion

RAG is now table stakes for production AI applications. It’s the bridge between frozen training data and live, accurate knowledge. Whether you’re building a customer support bot, a code assistant, or an internal search tool — RAG is almost certainly part of the stack.

Start simple: chunk your docs, embed them, store in a vector DB, and inject the top-3 results into your prompt. Then iterate.