Language models know a lot — but their knowledge is frozen at training time. Ask GPT-3 who won the 2024 election and it doesn’t know. Ask it about a specific company’s proprietary documents and it has nothing. Ask it a question with a wrong premise and it confidently hallucinates an answer rather than saying “I don’t have that information.” The problem is that LLMs store knowledge in their parameters — and parameters are static, opaque, and hard to update. RAG (Retrieval-Augmented Generation), proposed by Lewis et al. at Facebook AI in 2020, offers a different model: keep a separate, searchable knowledge store, and at inference time, look things up before answering.

The core idea

The analogy: You’re taking a difficult open-book exam. A closed-book approach means all your knowledge must live in your memory — you’re limited to what you absorbed during studying. An open-book approach means you can consult reference materials in the exam room. A brilliant open-book student can reason better than a brilliant closed-book student on questions requiring precise facts, because the book is more reliable than memory for specific details.

RAG makes language models open-book. Before generating an answer, the model retrieves relevant passages from an external document collection, then conditions its generation on both the original question and the retrieved passages. The model’s parameters still provide reasoning and language ability; the retrieved passages provide factual grounding.

The specific technical innovation: make the retrieval step differentiable, so the retriever and generator can be trained jointly end-to-end.

The mechanism, step by step

System components:

  1. Parametric memory: a seq2seq language model (BART in the paper) — the generator
  2. Non-parametric memory: a dense vector index of Wikipedia (~21M passages, ~100 words each)
  3. Retriever: Dense Passage Retrieval (DPR) — a bi-encoder that maps questions and passages into a shared embedding space

Retriever (DPR):

  • Encode the question:
  • Encode every Wikipedia passage: (precomputed and stored)
  • Retrieve top- passages by dot product similarity:

Because passage embeddings are precomputed, retrieval from 21M passages takes milliseconds with a FAISS index (approximate nearest neighbor search).

Two RAG variants:

RAG-Sequence: Retrieve one set of documents, use the same set for the entire generated sequence. The probability of output given question is:

Sum over retrieved documents , weighted by retrieval probability . The generator takes (question + retrieved passage) as input and produces the answer.

RAG-Token: Can attend to different documents for each output token — more flexible, useful when different parts of the answer come from different sources. The probability of each token marginalizes over documents independently:

QUERY: "What is the capital of France?"
  |
  [DPR: encode query → retrieve top-5 Wikipedia passages]
  |
Retrieved:
  1. "Paris is the capital and largest city of France..."
  2. "France is a country in Western Europe..."
  3. "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris..."
  4. ...
  |
  [BART generator: input = "Question: <query> Context: <retrieved passages>"]
  |
Output: "Paris"

All components trained jointly via backprop through marginal likelihood

Training:

The system is trained end-to-end on question-answer pairs by maximizing the marginal likelihood . You don’t need labeled (question, relevant-passage, answer) triples — just (question, answer) pairs. The model learns which passages are helpful by seeing which retrieval decisions lead to better answers. The DPR retriever is updated so that helpful passages get higher retrieval scores.

Find the instinct

Parametric vs. non-parametric memory:

The paper draws a sharp distinction between two ways of knowing something:

  • Parametric: baked into model weights during training (like memorizing a fact)
  • Non-parametric: stored explicitly in a database, looked up at query time (like writing a note you can reference later)

Language models are purely parametric. This creates several problems:

  1. Staleness: knowledge can’t be updated without retraining
  2. Capacity limits: you can only fit so much in weights; rare facts get compressed out
  3. No provenance: you can’t tell where the model’s knowledge came from
  4. Hallucination: when the parametric memory is uncertain, it interpolates between what it does know, producing plausible-sounding nonsense

RAG addresses all four: the knowledge store is updated by changing the index, the store can be arbitrarily large, retrieved passages provide explicit sources, and factual questions get answered from actual text rather than interpolated from compressed representations.

Why dense retrieval instead of BM25?

Classic retrieval (BM25) is keyword-based — it finds documents containing the same words as the query. Dense retrieval (DPR) encodes semantic meaning: “automobile” and “car” map to nearby embeddings. For knowledge-intensive NLP, semantic retrieval is much better because the query vocabulary often doesn’t match the document vocabulary.

The critical insight from DPR (a companion paper by the same group): you can train a BERT-based retriever to find supporting documents by starting with distant supervision — Wikipedia passages that contain the answer string are treated as positive examples. This creates a useful retriever without needing expensive annotation of which passage is “the right one” for each question.

Results

On open-domain QA (the model must answer without access to a closed knowledge base):

DatasetRAGPrevious SOTAType
Natural Questions44.541.5 (T5-11B)Exact match
WebQuestions45.541.7 (T5-11B)Exact match
TriviaQA68.060.5 (GPT-3 few-shot)Exact match
Jeopardy QG21.415.1 (BART-large)Q-BLEU-1

RAG-7B outperforms T5-11B (a model ~30× larger in parameters) on open-domain QA. The knowledge doesn’t need to be in parameters if you can look it up.

On the MS-MARCO dataset (a generative QA task), human evaluators preferred RAG outputs over parametric-only models on factual accuracy and specificity.

What doesn’t work:

  • Retrieval failure propagates: if the retriever returns irrelevant passages, the generator often hallucinates anyway
  • Multi-hop reasoning is hard: if answering requires connecting information from multiple distinct passages, the system struggles (one retrieval step isn’t enough)
  • Long-tail queries may not have relevant passages in the index
  • Latency: dense retrieval + generation is slower than generation-only

Practical implications

RAG is now the default approach for:

  • Question answering over private/proprietary documents (legal, medical, enterprise)
  • Keeping LLM knowledge current without retraining (news QA, market data)
  • Any system where answer provenance (citing sources) matters
  • Reducing hallucinations on factual queries

Modern RAG systems have evolved far beyond the original paper: hybrid retrieval (dense + sparse BM25), re-ranking, HyDE (hypothetical document embeddings), multi-hop retrieval, chunking strategies, and embedding model fine-tuning on domain-specific data. But the core insight — separate the knowledge store from the reasoning engine — comes from this paper.

Connections

Citation

arXiv:2005.11401

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401