Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Concepts: rag | rag | in-context-learning | long-context Builds on: rag-retrieval-augmented-generation Leads to: ai-feedback

The problem

Standard RAG has an embarrassing flaw: it always retrieves. Ask it “what is 2+2?” and it will go look something up. Ask it to write a poem and it will find documents to prepend. Retrieval is baked in unconditionally, and once retrieved, the model never asks whether the document was actually useful or whether its own answer reflects the document at all.

The result is a system that wastes compute on unnecessary retrieval, gets distracted by irrelevant passages, and generates responses that cite sources they don’t actually support. The original RAG paper (Lewis et al., 2020) acknowledged this but didn’t solve it. Self-RAG does.

The core idea

The analogy: Think of a researcher writing a report. A good researcher has three habits: (1) they know when to look something up — not every sentence needs a citation, but claims about specific facts do; (2) when they find a source, they scan it first to decide if it’s relevant before incorporating it; (3) after drafting a sentence, they re-read their source to check if the sentence is actually supported.

A bad researcher looks up something for every sentence, pastes in sources without reading them, and never cross-checks. Standard RAG is the bad researcher.

Self-RAG trains a single LM to be the good researcher — deciding when to retrieve, evaluating retrieved passages for relevance, generating a response, then critiquing whether that response is actually grounded in the evidence.

The mechanism:

Self-RAG introduces four types of special reflection tokens that the model generates as part of its normal output:

Token	Input	Values	Purpose
`Retrieve`	query + prior generation	yes / no / continue	Should I call the retriever right now?
`IsRel`	query + retrieved doc	relevant / irrelevant	Does this document actually help?
`IsSup`	query + doc + response	fully / partially / no support	Does my response follow from the doc?
`IsUse`	query + response	1 / 2 / 3 / 4 / 5	Is this response useful overall?

These tokens live in the model’s vocabulary — they’re not prompts, not outputs from a separate model, not post-hoc scores. The model generates them in-line, as naturally as it generates the next word.

“Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements.”

The ASCII flow:

INPUT: "Who won the 1969 World Series?"

Step 1: Model generates [Retrieve=Yes]
                         ↓
Step 2: Retriever fetches K=3 passages

   Passage d1: "New York Mets won 1969 World Series..."
   Passage d2: "1969 was a transformative year in baseball..."
   Passage d3: "The Miracle Mets upset the Baltimore Orioles..."

Step 3: For each passage, model generates in PARALLEL:
   d1: [IsRel=Relevant] → "The New York Mets won." → [IsSup=Fully supported] [IsUse=5]
   d2: [IsRel=Irrelevant] → (filtered out)
   d3: [IsRel=Relevant] → "The Mets won in an upset." → [IsSup=Fully supported] [IsUse=4]

Step 4: Rank by critique scores → select d1's generation

OUTPUT: "The New York Mets won the 1969 World Series."
        + self-assessed citation: d1 [fully supported]


INPUT: "Write a haiku about autumn."

Step 1: Model generates [Retrieve=No]
                         ↓
Step 2: Generate directly, no retrieval
Step 3: [IsUse=4]

OUTPUT: (haiku, no citation needed)

The math, translated:

At each segment step, when retrieval happens, the model generates $K$ candidate segments (one per retrieved passage). It ranks them using a critique score:

$S (Critique) = \sum_{G \in G} w^{G} \cdot s_{t}^{G}$

where $G = {IsRel, IsSup, IsUse}$ and each $s_{t}^{G}$ is the normalized probability that the model assigns to the most desirable reflection token for group $G$ :

$s_{t}^{G} = \frac{p _{t} ( r ^ )}{\sum _{i = 1}^{N^{G}} p _{t} ( r _{i} )}$

Translation: for IsRel, $\overset{r}{^}$ = “relevant” — so $s_{t}^{IsRel}$ is how confident the model is that this passage is relevant. For IsSup, $\overset{r}{^}$ = “fully supported”. These are just probabilities over the reflection token vocabulary — the model treating its own metacognition as next-token prediction.

The weights $w^{G}$ are adjustable at inference time. Want more citation precision? Increase $w^{IsSup}$ . Want more completeness? Increase $w^{IsUse}$ . No retraining needed — just adjust the weights.

Walkthrough with actual numbers:

Say we have 3 retrieved passages and the model assigns these reflection token probabilities:

                  IsRel          IsSup           IsUse
                  p(rel)/p(irr)  p(full)/p(par)/p(no)  1-5 probs → score
Passage d1:       0.91 / 0.09   0.82 / 0.13 / 0.05    5: 0.61
Passage d2:       0.23 / 0.77   0.11 / 0.44 / 0.45    3: 0.38
Passage d3:       0.74 / 0.26   0.55 / 0.30 / 0.15    4: 0.49

Normalize each:
d1: s_IsRel = 0.91/(0.91+0.09) = 0.91
    s_IsSup = 0.82/(0.82+0.13+0.05) = 0.82
    s_IsUse = 0.61 (using normalized prob of "5")

With weights w_IsRel=0.5, w_IsSup=1.0, w_IsUse=0.5:
d1 score = 0.5×0.91 + 1.0×0.82 + 0.5×0.61 = 0.455 + 0.820 + 0.305 = 1.580
d2 score = 0.5×0.23 + 1.0×0.11 + 0.5×0.38 = 0.115 + 0.110 + 0.190 = 0.415
d3 score = 0.5×0.74 + 1.0×0.55 + 0.5×0.49 = 0.370 + 0.550 + 0.245 = 1.165

Winner: d1 — most relevant, best supported, highest utility.

d2 is filtered out before scoring even gets serious: IsRel probability 0.23 already signals the retriever found a tangentially related document.

What’s clever — find the instinct:

The key design question was: how do you teach a model to self-critique without adding a separate critic at inference time?

The answer is a two-phase training trick. First, train a critic model (a small LM) on 4k-20k examples per reflection token type, with labels generated by GPT-4. The critic is cheap — you only run it once.

“We create supervised data by prompting GPT-4 to generate reflection tokens and then distill their knowledge into an in-house critic model.”

Then, use the critic to annotate your entire training corpus offline — inserting reflection tokens into the training data. The generator LM trains on this annotated corpus using standard next-token prediction. By the time the generator is deployed, it has internalized the critic completely.

“This eliminates the need to host a critic model during training, reducing overhead.”

At inference, no critic needed — the generator just generates. The distinction between “generating text” and “critiquing text” collapses into a single forward pass.

This is the Distill rule applied architecturally: a small teacher (the critic) annotates data offline, and a student (the generator) learns to reproduce that annotation as part of normal generation. The inference cost of a separate critic model drops to zero.

Does it work? What breaks?

Task	Self-RAG 13B	Best RAG baseline	ChatGPT (RAG)
PopQA (open-domain QA)	73.9	~57 (Llama2-chat 13B + RAG)	50.9
TriviaQA	71.3	~62 (Llama2-chat 13B + RAG)	~64
ARC-Challenge	79.9	73.1 (CoVE 65B)	70.6

On PopQA — a long-tail QA benchmark where parametric knowledge is systematically insufficient — Self-RAG 13B beats retrieval-augmented ChatGPT by 23 points. The questions are about entities with low Wikipedia traffic, exactly where unconditional RAG fails because it retrieves but doesn’t verify relevance.

Citation accuracy on long-form generation (ASQA): Self-RAG achieves significantly higher citation precision and recall than retrieval-augmented baselines — because it explicitly trains the model to produce IsSup=fully supported tokens only when warranted.

What doesn’t work:

The parallel passage processing is expensive. For each segment, you run K forward passes in parallel. At K=5-10 retrieved passages, inference cost multiplies. The paper compares to baselines that retrieve once; Self-RAG retrieves adaptively but each retrieval is more expensive.

The soft constraint weights ( $w^{G}$ ) must be tuned manually per task. The paper shows you can swap them to trade off citation precision vs. completeness, but there’s no automatic way to find good weights for a new task — it’s hyperparameter search.

“indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation.”

This was the problem Self-RAG set out to fix — and it largely does, but at inference cost. Real-time RAG pipelines with latency budgets may need to trade off the self-reflection loop.

Reflection token quality also depends on the critic’s training data. The critic was trained on GPT-4 annotations across a specific distribution of tasks. Novel task formats (structured code generation, math proof verification) may get noisy reflection tokens.

So what?

If you’re building RAG systems, Self-RAG’s most practical contribution is the retrieval gating idea: don’t retrieve for every query. Plug in a fast classifier that predicts Retrieve=yes/no before calling your retriever — it’s a cheaper version of Self-RAG’s Retrieve token that eliminates wasted API calls on questions your LM can answer from parametric knowledge. The full Self-RAG pipeline is expensive to run, but the intuition (adaptive retrieval + relevance filtering) is implementable in lighter forms.

For long-form generation with citation requirements — think research summarization, legal analysis, medical Q&A — the IsSup mechanism directly addresses hallucinated citations. If your use case has a “show your work” requirement, Self-RAG’s design pattern is worth implementing even at inference cost.

Remember how the original rag-retrieval-augmented-generation paper showed that models with access to a retriever beat parametric-only models on knowledge-intensive tasks? Self-RAG completes that story: not just “give the model a retriever” but “teach the model to use the retriever wisely.” The difference between retrieval-augmented and retrieval-controlled.

The reflection token idea connects to ai-feedback and chain-of-thought — both are ways to get models to externalize internal reasoning steps as part of normal generation. Self-RAG applies the same principle to metacognitive control: make the model’s decisions about how to answer as explicit and trainable as the answer itself.

A model that knows when to look something up — and checks its own work after — will always beat one that doesn’t, regardless of parameter count.

Paper: Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection — Asai, Wu, Wang, Sil, Hajishirzi — 2023

Connections

rag — Self-RAG is adaptive RAG: retrieves on-demand rather than unconditionally
rag — introduces learned retrieval gating (Retrieve token) vs. fixed retrieval schedules
in-context-learning — reflection tokens extend ICL: the model learns to condition on its own metacognitive signals
long-context — Self-RAG is motivated by the same failure mode: models shouldn’t blindly use all available context
instruction-following — trained on diverse instruction-following datasets, evaluated zero-shot
ai-feedback — critic model trained on GPT-4 annotations; a form of AI-generated supervision
rag-retrieval-augmented-generation — the original RAG paper whose unconditional retrieval Self-RAG replaces

Citation

arXiv:2310.11511

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv preprint. https://arxiv.org/abs/2310.11511

ML Wiki

Explorer