RAG (Retrieval-Augmented Generation)

What It Is

Retrieval-Augmented Generation (RAG) is an architecture that augments a language model’s generation with documents retrieved from an external knowledge store at inference time. Instead of encoding all knowledge in model weights, RAG separates reasoning (parametric memory in the LLM) from factual knowledge (non-parametric memory in a searchable index), combining both to answer queries.

Why It Matters

LLMs have frozen parametric knowledge — they can’t update what they know without retraining, can’t access private or real-time data, and hallucinate when asked about facts they don’t know well. RAG solves all three: the knowledge store can be updated independently, can hold proprietary documents, and retrieved passages provide real text grounding that replaces guessed facts. RAG is now the default architecture for any knowledge-intensive application built on LLMs.

The Core Architecture

Three components:

Retriever: encodes the query into a vector and retrieves top-k passages from an index by vector similarity (dense retrieval) or keyword match (BM25 sparse retrieval), or both (hybrid).
Index: a precomputed store of document embeddings. For dense retrieval, each passage is encoded with a bi-encoder (e.g., BERT-based) and stored in a vector database (FAISS, Pinecone, Weaviate, etc.).
Generator: a sequence-to-sequence or decoder-only LLM that takes (query + retrieved passages) as input and generates the answer.

USER QUERY
    |
[Retriever: encode query → nearest-neighbor search over index]
    |
Retrieved passages (top-k, e.g. k=5)
    |
[Prompt assembly: "Answer this question: <query>\n\nContext:\n<passage_1>\n<passage_2>..."]
    |
[LLM Generator: produces answer conditioned on query + context]
    |
ANSWER (with optional source citations)

Parametric vs. Non-Parametric Memory

The paper that introduced RAG (Lewis et al. 2020) draws a sharp distinction:

Parametric memory: knowledge baked into model weights during training. Fast at inference (no lookup), but static, opaque, and capacity-limited. Rare or recent facts get compressed out.
Non-parametric memory: stored in an explicit, updatable database. Slower (requires retrieval), but transparent (you can see which passages were retrieved), fresh (update the index without retraining), and arbitrarily large.

RAG combines both: the LLM contributes reasoning and language fluency; the index contributes factual grounding.

Dense vs. Sparse Retrieval

Sparse (BM25): keyword-based frequency scoring. Fast, interpretable, excellent for exact-match queries (“Who wrote Hamlet?”). Fails when query and document use different vocabulary (“automobile” vs “car”).

Dense (DPR, bi-encoder): encode query and passages into a shared vector space; retrieve by cosine/dot product similarity. Captures semantic similarity across vocabulary. Better for paraphrase matching and complex queries.

Hybrid: combine both scores. Most production RAG systems use hybrid retrieval for best coverage.

RAG Variants and Extensions

Naive RAG (original): one retrieval step, all retrieved passages in context, single generation pass.

Advanced RAG patterns:

Re-ranking: after initial retrieval, a cross-encoder re-ranks top-k passages for precision
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, retrieve similar real documents to that answer (improves recall for hard queries)
Multi-hop RAG: retrieve, generate an intermediate answer, retrieve again based on that — for questions requiring evidence from multiple sources
FLARE: generate iteratively, triggering retrieval when the model is uncertain (low token probability)

What Doesn’t Work

Retrieval failure cascades: if the retriever returns irrelevant passages, the LLM often hallucinates regardless — generation is only as good as retrieval
Multi-hop reasoning: single-step RAG can’t connect information from passages about different sub-questions
Long-tail queries: very specific questions may have no relevant passage in the index
Chunking sensitivity: how documents are split into retrievable chunks significantly affects performance; there is no universal optimal chunk size

Practical Implications

RAG is the default architecture for:

Enterprise Q&A over internal documentation
Customer support with proprietary knowledge bases
Medical/legal Q&A with citation requirements
Any system where LLM knowledge cutoff is a problem

Building a RAG system requires decisions about: embedding model choice, chunk size and overlap, retrieval k, re-ranking, prompt format, and whether to include source citations. Each of these significantly affects quality.

Key Sources

rag-retrieval-augmented-generation — the original RAG paper (Lewis et al. 2020)
self-rag-learning-to-retrieve-generate-critique — Self-RAG: replaces unconditional retrieval with learned Retrieve/IsRel/IsSup/IsUse reflection tokens, enabling adaptive, self-critiquing RAG in a single model

in-context-learning — retrieved passages are placed in the LLM context window
foundation-models — RAG extends foundation models with updatable external memory
tool-use-agents — in agentic systems, retrieval can be one tool among many
transfer-learning — the retriever is often a pretrained encoder fine-tuned for retrieval

Open Questions

Optimal retrieval granularity (sentence, paragraph, section, document)?
When does RAG outperform long-context models, and vice versa?
How to train the retriever and generator jointly in open-domain settings without annotated (query, passage) pairs?
Robustness to adversarial documents inserted into the index

ML Wiki

Explorer

RAG (Retrieval-Augmented Generation)

What It Is

Why It Matters

The Core Architecture

Parametric vs. Non-Parametric Memory

Dense vs. Sparse Retrieval

RAG Variants and Extensions

What Doesn’t Work

Practical Implications

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

RAG (Retrieval-Augmented Generation)

What It Is

Why It Matters

The Core Architecture

Parametric vs. Non-Parametric Memory

Dense vs. Sparse Retrieval

RAG Variants and Extensions

What Doesn’t Work

Practical Implications

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks