What It Is
Retrieval-Augmented Generation (RAG) is an architecture that augments a language model’s generation with documents retrieved from an external knowledge store at inference time. Instead of encoding all knowledge in model weights, RAG separates reasoning (parametric memory in the LLM) from factual knowledge (non-parametric memory in a searchable index), combining both to answer queries.
Why It Matters
LLMs have frozen parametric knowledge — they can’t update what they know without retraining, can’t access private or real-time data, and hallucinate when asked about facts they don’t know well. RAG solves all three: the knowledge store can be updated independently, can hold proprietary documents, and retrieved passages provide real text grounding that replaces guessed facts. RAG is now the default architecture for any knowledge-intensive application built on LLMs.
The Core Architecture
Three components:
- Retriever: encodes the query into a vector and retrieves top-k passages from an index by vector similarity (dense retrieval) or keyword match (BM25 sparse retrieval), or both (hybrid).
- Index: a precomputed store of document embeddings. For dense retrieval, each passage is encoded with a bi-encoder (e.g., BERT-based) and stored in a vector database (FAISS, Pinecone, Weaviate, etc.).
- Generator: a sequence-to-sequence or decoder-only LLM that takes (query + retrieved passages) as input and generates the answer.
USER QUERY
|
[Retriever: encode query → nearest-neighbor search over index]
|
Retrieved passages (top-k, e.g. k=5)
|
[Prompt assembly: "Answer this question: <query>\n\nContext:\n<passage_1>\n<passage_2>..."]
|
[LLM Generator: produces answer conditioned on query + context]
|
ANSWER (with optional source citations)
Parametric vs. Non-Parametric Memory
The paper that introduced RAG (Lewis et al. 2020) draws a sharp distinction:
- Parametric memory: knowledge baked into model weights during training. Fast at inference (no lookup), but static, opaque, and capacity-limited. Rare or recent facts get compressed out.
- Non-parametric memory: stored in an explicit, updatable database. Slower (requires retrieval), but transparent (you can see which passages were retrieved), fresh (update the index without retraining), and arbitrarily large.
RAG combines both: the LLM contributes reasoning and language fluency; the index contributes factual grounding.
Dense vs. Sparse Retrieval
Sparse (BM25): keyword-based frequency scoring. Fast, interpretable, excellent for exact-match queries (“Who wrote Hamlet?”). Fails when query and document use different vocabulary (“automobile” vs “car”).
Dense (DPR, bi-encoder): encode query and passages into a shared vector space; retrieve by cosine/dot product similarity. Captures semantic similarity across vocabulary. Better for paraphrase matching and complex queries.
Hybrid: combine both scores. Most production RAG systems use hybrid retrieval for best coverage.
RAG Variants and Extensions
Naive RAG (original): one retrieval step, all retrieved passages in context, single generation pass.
Advanced RAG patterns:
- Re-ranking: after initial retrieval, a cross-encoder re-ranks top-k passages for precision
- HyDE (Hypothetical Document Embeddings): generate a hypothetical answer, retrieve similar real documents to that answer (improves recall for hard queries)
- Multi-hop RAG: retrieve, generate an intermediate answer, retrieve again based on that — for questions requiring evidence from multiple sources
- FLARE: generate iteratively, triggering retrieval when the model is uncertain (low token probability)
What Doesn’t Work
- Retrieval failure cascades: if the retriever returns irrelevant passages, the LLM often hallucinates regardless — generation is only as good as retrieval
- Multi-hop reasoning: single-step RAG can’t connect information from passages about different sub-questions
- Long-tail queries: very specific questions may have no relevant passage in the index
- Chunking sensitivity: how documents are split into retrievable chunks significantly affects performance; there is no universal optimal chunk size
Practical Implications
RAG is the default architecture for:
- Enterprise Q&A over internal documentation
- Customer support with proprietary knowledge bases
- Medical/legal Q&A with citation requirements
- Any system where LLM knowledge cutoff is a problem
Building a RAG system requires decisions about: embedding model choice, chunk size and overlap, retrieval k, re-ranking, prompt format, and whether to include source citations. Each of these significantly affects quality.
Key Sources
- rag-retrieval-augmented-generation — the original RAG paper (Lewis et al. 2020)
Related Concepts
- in-context-learning — retrieved passages are placed in the LLM context window
- foundation-models — RAG extends foundation models with updatable external memory
- tool-use-agents — in agentic systems, retrieval can be one tool among many
- transfer-learning — the retriever is often a pretrained encoder fine-tuned for retrieval
Open Questions
- Optimal retrieval granularity (sentence, paragraph, section, document)?
- When does RAG outperform long-context models, and vice versa?
- How to train the retriever and generator jointly in open-domain settings without annotated (query, passage) pairs?
- Robustness to adversarial documents inserted into the index