RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP

Concepts: rag | encoder-decoder | in-context-learning | long-context | pre-training Builds on: bart-denoising-sequence-to-sequence-pre-training | attention-is-all-you-need Leads to: react-reasoning-and-acting | toolformer-language-models-teach-themselves-tool-use

Language models know a lot — but their knowledge is frozen at training time. Ask GPT-3 who won the 2024 election and it doesn’t know. Ask it about a specific company’s proprietary documents and it has nothing. Ask it a question with a wrong premise and it confidently hallucinates an answer rather than saying “I don’t have that information.” The problem is that LLMs store knowledge in their parameters — and parameters are static, opaque, and hard to update. RAG (Retrieval-Augmented Generation), proposed by Lewis et al. at Facebook AI in 2020, offers a different model: keep a separate, searchable knowledge store, and at inference time, look things up before answering.

The core idea

The analogy: You’re taking a difficult open-book exam. A closed-book approach means all your knowledge must live in your memory — you’re limited to what you absorbed during studying. An open-book approach means you can consult reference materials in the exam room. A brilliant open-book student can reason better than a brilliant closed-book student on questions requiring precise facts, because the book is more reliable than memory for specific details.

RAG makes language models open-book. Before generating an answer, the model retrieves relevant passages from an external document collection, then conditions its generation on both the original question and the retrieved passages. The model’s parameters still provide reasoning and language ability; the retrieved passages provide factual grounding.

The specific technical innovation: make the retrieval step differentiable, so the retriever and generator can be trained jointly end-to-end.

The mechanism, step by step

System components:

Parametric memory: a seq2seq language model (BART in the paper) — the generator
Non-parametric memory: a dense vector index of Wikipedia (~21M passages, ~100 words each)
Retriever: Dense Passage Retrieval (DPR) — a bi-encoder that maps questions and passages into a shared embedding space

Retriever (DPR):

Encode the question: $q = BERT_{q} (question)$
Encode every Wikipedia passage: $p = BERT_{p} (passage)$ (precomputed and stored)
Retrieve top- $k$ passages by dot product similarity: $score (q, p) = q^{⊤} p$

Because passage embeddings are precomputed, retrieval from 21M passages takes milliseconds with a FAISS index (approximate nearest neighbor search).

Numeric walkthrough — how scores combine:

Say we retrieve $k = 3$ documents for the question “What year was the Eiffel Tower built?” Their retrieval probabilities (after softmax over dot products) are:

doc_1: "The Eiffel Tower was constructed between 1887 and 1889..."  p_η = 0.70
doc_2: "Paris landmarks include the Eiffel Tower and Notre Dame..."  p_η = 0.20
doc_3: "Iron lattice structures became common in the 1880s..."       p_η = 0.10

For RAG-Sequence, BART generates P(y=“1887–1889” | question, doc_i) for each doc:

P(y | doc_1) = 0.90   # doc_1 contains the answer directly
P(y | doc_2) = 0.15   # doc_2 is tangentially related
P(y | doc_3) = 0.05   # doc_3 is only thematically relevant

RAG-Sequence score = 0.70 × 0.90 + 0.20 × 0.15 + 0.10 × 0.05
                   = 0.630 + 0.030 + 0.005
                   = 0.665

This is the marginal likelihood the system optimizes — no direct supervision on which document to use, only end-to-end signal from whether the final answer is correct. The retriever learns to assign high $p_{η}$ to documents that genuinely help the generator produce correct answers.

Two RAG variants:

RAG-Sequence: Retrieve one set of documents, use the same set for the entire generated sequence. The probability of output $y$ given question $x$ is:

$p_{θ} (y ∣ x) = \sum_{z} p_{η} (z ∣ x) \cdot p_{θ} (y ∣ x, z)$

Sum over retrieved documents $z$ , weighted by retrieval probability $p_{η} (z ∣ x)$ . The generator $p_{θ}$ takes (question + retrieved passage) as input and produces the answer.

RAG-Token: Can attend to different documents for each output token — more flexible, useful when different parts of the answer come from different sources. The probability of each token $y_{i}$ marginalizes over documents independently:

$p_{θ} (y_{i} ∣ x, y_{1 : i - 1}) = \sum_{z} p_{η} (z ∣ x) \cdot p_{θ} (y_{i} ∣ x, z, y_{1 : i - 1})$

QUERY: "What is the capital of France?"
  |
  [DPR: encode query → retrieve top-5 Wikipedia passages]
  |
Retrieved:
  1. "Paris is the capital and largest city of France..."
  2. "France is a country in Western Europe..."
  3. "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris..."
  4. ...
  |
  [BART generator: input = "Question: <query> Context: <retrieved passages>"]
  |
Output: "Paris"

All components trained jointly via backprop through marginal likelihood

Training:

The system is trained end-to-end on question-answer pairs by maximizing the marginal likelihood $\sum_{z} p_{η} (z ∣ x) \cdot p_{θ} (y ∣ x, z)$ . You don’t need labeled (question, relevant-passage, answer) triples — just (question, answer) pairs. The model learns which passages are helpful by seeing which retrieval decisions lead to better answers. The DPR retriever is updated so that helpful passages get higher retrieval scores.

Find the instinct

Parametric vs. non-parametric memory:

The paper draws a sharp distinction between two ways of knowing something:

Parametric: baked into model weights during training (like memorizing a fact)
Non-parametric: stored explicitly in a database, looked up at query time (like writing a note you can reference later)

As the paper puts it:

“Large pre-trained language models have been shown to store factual knowledge in their parameters… However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures.”

Translation: GPT-3 memorized a lot, but it memorized it badly — compressed, unverifiable, and frozen. That’s the gap RAG fills.

Language models are purely parametric. This creates several problems:

Staleness: knowledge can’t be updated without retraining
Capacity limits: you can only fit so much in weights; rare facts get compressed out
No provenance: you can’t tell where the model’s knowledge came from
Hallucination: when the parametric memory is uncertain, it interpolates between what it does know, producing plausible-sounding nonsense

RAG addresses all four: the knowledge store is updated by changing the index, the store can be arbitrarily large, retrieved passages provide explicit sources, and factual questions get answered from actual text rather than interpolated from compressed representations.

The paper demonstrates this dramatically with index hot-swapping. Using a 2016 Wikipedia index vs. a 2018 index for questions about world leaders who changed between those years:

“RAG answers 70% correctly using the 2016 index for 2016 world leaders and 68% using the 2018 index for 2018 world leaders. Accuracy with mismatched indices is low (12% with the 2018 index and 2016 leaders, 4% with the 2016 index and 2018 leaders).”

This shows we can update RAG’s world knowledge by simply replacing its non-parametric memory — no retraining at all. A parametric-only model would need a full training run to update what it knows about who leads Peru.

Why dense retrieval instead of BM25?

Classic retrieval (BM25) is keyword-based — it finds documents containing the same words as the query. Dense retrieval (DPR) encodes semantic meaning: “automobile” and “car” map to nearby embeddings. For knowledge-intensive NLP, semantic retrieval is much better because the query vocabulary often doesn’t match the document vocabulary.

The critical insight from DPR (a companion paper by the same group): you can train a BERT-based retriever to find supporting documents by starting with distant supervision — Wikipedia passages that contain the answer string are treated as positive examples. This creates a useful retriever without needing expensive annotation of which passage is “the right one” for each question.

Results

On open-domain QA (the model must answer without access to a closed knowledge base):

Dataset	RAG	Previous SOTA	Type
Natural Questions	44.5	41.5 (T5-11B)	Exact match
WebQuestions	45.5	41.7 (T5-11B)	Exact match
TriviaQA	68.0	60.5 (GPT-3 few-shot)	Exact match
Jeopardy QG	21.4	15.1 (BART-large)	Q-BLEU-1

RAG-7B outperforms T5-11B (a model ~30× larger in parameters) on open-domain QA. The knowledge doesn’t need to be in parameters if you can look it up.

On the MS-MARCO dataset (a generative QA task), human evaluators preferred RAG outputs over parametric-only models on factual accuracy and specificity.

What doesn’t work:

Retrieval failure propagates: if the retriever returns irrelevant passages, the generator often hallucinates anyway
Multi-hop reasoning is hard: if answering requires connecting information from multiple distinct passages, the system struggles (one retrieval step isn’t enough)
Long-tail queries may not have relevant passages in the index
Latency: dense retrieval + generation is slower than generation-only

Practical implications

RAG is now the default approach for:

Question answering over private/proprietary documents (legal, medical, enterprise)
Keeping LLM knowledge current without retraining (news QA, market data)
Any system where answer provenance (citing sources) matters
Reducing hallucinations on factual queries

Modern RAG systems have evolved far beyond the original paper: hybrid retrieval (dense + sparse BM25), re-ranking, HyDE (hypothetical document embeddings), multi-hop retrieval, chunking strategies, and embedding model fine-tuning on domain-specific data. But the core insight — separate the knowledge store from the reasoning engine — comes from this paper.

A surprising finding from the Jeopardy question generation task reveals how parametric and non-parametric memory interact:

“This example shows how parametric and non-parametric memories work together — the non-parametric component helps to guide the generation, drawing out specific knowledge stored in the parametric memory.”

The retriever doesn’t just supply facts the model doesn’t have — it activates and steers knowledge the model already learned during pretraining. Retrieved context is a prompt for the model’s latent knowledge, not just a substitution for it.

Connections

rag — the technique introduced in this paper
in-context-learning — retrieved passages are provided as in-context evidence for the generator
foundation-models — RAG extends foundation models with non-parametric memory
attention-is-all-you-need — the Transformer architecture that powers both the DPR retriever and BART generator
react-reasoning-and-acting — ReAct extends retrieval further by making it an explicit action in an agentic loop
toolformer-language-models-teach-themselves-tool-use — Toolformer teaches self-supervised tool use; RAG can be seen as a specific tool (search)

Citation

arXiv:2005.11401

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401

ML Wiki

Explorer