By late 2019, the NLP community had two powerful but incompatible pre-training recipes. BERT’s bidirectional encoder was exceptional at understanding — it could see the full context on both sides of every token, making it ideal for classification, NER, and question answering. But BERT couldn’t generate text: its architecture had no decoder, no autoregressive mechanism. GPT’s causal decoder was exceptional at generation — it produced fluent text by predicting one token at a time — but saw only left context, making it weaker at comprehension tasks. Every practitioner had to pick a side. BART asked: what if you didn’t have to?
The core idea
The analogy: Think of a restoration archivist handed a damaged manuscript. The manuscript has passages blacked out, sentences scrambled, words missing. The archivist reads the whole damaged document carefully — using context from both before and after each damage — and then writes out the clean version from scratch. This process trains two distinct skills simultaneously: deep bidirectional reading (to understand what’s damaged and infer what’s missing) and fluent sequential writing (to produce the clean reconstruction).
That’s BART. The encoder reads a corrupted version of a document. The decoder generates the original clean document autoregressively, cross-attending to the encoder’s understanding at each generation step. Train long enough on this task and you have a model that’s simultaneously a strong bidirectional encoder (like BERT) and a strong autoregressive decoder (like GPT).
The mechanism, step by step:
- Take a training document (any text).
- Apply a noising function that corrupts the text: delete tokens, shuffle sentences, mask spans.
- The corrupted document is fed to a bidirectional Transformer encoder — it processes all corrupted tokens in parallel, seeing full context in both directions.
- A causal Transformer decoder generates the original clean document one token at a time, using causal self-attention over its own outputs so far, plus cross-attention to the encoder’s output at every layer.
- The training loss is the standard cross-entropy between the decoder’s output and the original text.
The architecture is identical to the original seq2seq Transformer — 12 encoder + 12 decoder layers for BART-large (about 400M parameters total, similar to RoBERTa-large). No new architectural components. The entire contribution is in what you train it to do.
Five noise types, one winner:
The paper evaluates five ways to corrupt text:
1. TOKEN MASKING (like BERT)
Original: "The cat sat on the mat"
Corrupted: "The [M] sat on the [M]"
Model knows: exactly N tokens are missing
2. TOKEN DELETION
Original: "The cat sat on the mat"
Corrupted: "The cat on the mat"
Model must find: which positions were deleted
3. TEXT INFILLING ← the novel contribution
Original: "The cat sat on the mat"
Corrupted: "The [M] mat" ← "cat sat on the" → single [M]
Model must infer: WHAT was removed AND HOW MANY tokens
4. SENTENCE PERMUTATION
Original: "Sent A. Sent B. Sent C."
Corrupted: "Sent C. Sent A. Sent B."
Model must restore: sentence order
5. DOCUMENT ROTATION
Original: "Token1 Token2 Token3..."
Corrupted: "Token3... Token1 Token2"
Model must identify: the start token
The winner: text infilling + sentence permutation combined. Text infilling alone is the best single-strategy for generation tasks. Sentence permutation alone is good for comprehension. Together they dominate on everything.
Why text infilling is the key insight:
BERT uses 1:1 token masking — one [MASK] per missing token. The model always knows exactly how many tokens to predict. It’s a fill-in-the-blank crossword: the number of blanks tells you the answer length.
BART’s text infilling replaces contiguous spans of variable length (drawn from a Poisson distribution with mean 3) with a single [MASK] token. One [MASK] might hide zero words, one word, or twelve words. The model has to predict both the content of the missing span AND its length.
“Text Infilling: A number of text spans are sampled, with span lengths drawn from a Poisson distribution (λ = 3). Each span is replaced with a single mask token. 0-length spans correspond to the insertion of mask tokens.”
This is strictly harder than BERT masking. The model must learn to look at surrounding context and reason: “given what comes before and after this [MASK], how long was the passage that’s missing?” That richer understanding is what makes BART particularly strong at generation.
The ASCII diagram:
PRETRAINING:
Original text: "Paris is the capital of France and home to the Eiffel Tower"
Noising (text infilling, span = 4 tokens "capital of France and"):
Corrupted: "Paris is the [M] home to the Eiffel Tower"
┌──────────────────────────────────────────┐
Encoder input: ── │ Paris is the [M] home to the ... │ (bidirectional)
(corrupted) │ ↕ ↕ ↕ ↕ ↕ ↕ ↕ │
│ full context both directions │
└──────────────────┬───────────────────────┘
│ encoder hidden states
│ (cross-attention)
┌──────────────────▼───────────────────────┐
Decoder output: ── │ Paris is the capital of France ... │ (left-to-right)
(original text) └──────────────────────────────────────────┘
FINE-TUNING (summarization):
Article text ──→ [Encoder] ──→ hidden states ──→ [Decoder] ──→ Summary sentence
(reads full (generates summary
article) autoregressively)
Walkthrough with actual numbers:
Let’s trace text infilling on a concrete example with BART’s actual vocabulary size (50,265 tokens).
Original (6 tokens): "Paris is the capital of France"
Span length drawn from Poisson(λ=3): say we draw 4
Span selected: "the capital of France" (tokens 3–6)
Corrupted (3 tokens): "Paris is [MASK]"
ENCODER processes: ["Paris", "is", "[MASK]"]
Produces hidden states: h_Paris, h_is, h_MASK ← all computed bidirectionally
DECODER generates the original, token by token:
Step 1: input=<s> → cross-attend to encoder → P(Paris) = 0.89
loss = -log(0.89) = 0.116 nats
Step 2: input=<s> Paris → cross-attend to encoder → P(is) = 0.93
loss = -log(0.93) = 0.073 nats
Step 3: input=<s> Paris is → cross-attend to encoder → P(the) = 0.67
loss = -log(0.67) = 0.400 nats ← harder: 4 tokens could follow [MASK]
Step 4: input=<s> Paris is the → cross-attend to encoder → P(capital) = 0.78
loss = -log(0.78) = 0.248 nats
Step 5: ...of... → P(of) = 0.91 → loss = 0.094 nats
Step 6: ...France... → P(France) = 0.94 → loss = 0.062 nats
Step 7: ...France... → P(</s>) = 0.96 → loss = 0.041 nats
Total reconstruction loss = 0.116 + 0.073 + 0.400 + 0.248 + 0.094 + 0.062 + 0.041
= 1.034 nats
BERT-style masking of the same 4 tokens would give 4 separate [MASK] tokens,
telling the model: "exactly 4 tokens missing here." The task is easier —
the model never has to infer span length. BART's harder objective produces richer representations.
At decode step 3 (generating “the”), the cross-attention scores over encoder tokens tell the story:
h_Parisgets low weight (~0.08) — not relevant for infillingh_isgets moderate weight (~0.20) — provides grammatical contexth_[MASK]gets high weight (~0.72) — the model’s attention is on the corrupted region
The model has learned to focus on the masked region to predict what’s missing, while using surrounding tokens for grammatical and semantic constraints.
What’s clever — find the instinct:
The seq2seq architecture for pre-training wasn’t new (T5 was developed in parallel). The clever move was recognizing that the choice of noise function isn’t a detail — it determines what the model learns. Previous work treated the noising function as an afterthought. BART’s ablations make it the central variable.
“We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme.”
The deeper insight: because the decoder must reconstruct the entire original (not just the masked spans), BART’s training is a stronger signal than BERT’s. BERT predicts 15% of tokens. BART’s decoder predicts 100% of tokens — it sees every position as a potential training signal. This is why BART fine-tunes better for generation tasks than BERT: it has spent pre-training practicing generation.
“Unlike GPT, we see all of the original sentence during pretraining due to the bidirectional encoder. Unlike BERT, we generate token by token with an autoregressive decoder.”
Results
Summarization (the killer result):
| Task | BART | Previous SOTA | Gain |
|---|---|---|---|
| CNN/DailyMail (R-1) | 44.16 | 43.68 (ProphetNet) | +0.48 |
| CNN/DailyMail (R-2) | 21.28 | 20.64 | +0.64 |
| XSum (R-1) | 45.14 | 38.76 (BERTSumExtAbs) | +6.38 |
| XSum (R-2) | 22.27 | 16.33 | +5.94 |
| XSum (R-L) | 37.25 | 31.15 | +6.10 |
The XSum result is remarkable. XSum requires abstractive summarization — the summary uses different words than the article. BART’s generative pre-training directly targets this; extractive models (which copy spans) can’t compete. This +6 ROUGE gain is not incremental.
Comprehension (matches the best encoder-only models):
| Task | BART | RoBERTa-large |
|---|---|---|
| SQuAD 1.1 (EM / F1) | 88.8 / 94.6 | 88.9 / 94.6 |
| GLUE (avg.) | 88.4 | 88.5 |
BART matches RoBERTa-large on comprehension tasks — using the same pre-training data and compute — while also being a strong generative model. RoBERTa cannot generate. BART can do both.
Noise function ablation (which corruption matters):
| Noising scheme | XSum R-2 | CNN/DM R-2 |
|---|---|---|
| Token masking | 6.1 | 12.8 |
| Token deletion | 8.2 | 14.5 |
| Text infilling | 13.5 | 16.5 |
| Sent. permutation | 7.8 | 13.9 |
| Text infilling + sent. perm. | 14.5 | 17.1 |
Token masking (pure BERT-style) is the weakest. Text infilling is the single strongest component. The combination is the winner.
What breaks:
BART’s 400M parameters make it expensive relative to encoder-only models for comprehension-only tasks. If you only need classification, RoBERTa with 355M parameters is simpler and equally good.
Text infilling doesn’t help with factual tasks where the model needs to retrieve specific facts (named entities, dates). A model that has memorized Wikipedia will outperform BART on knowledge-intensive QA tasks — this is the gap that RAG systems later addressed.
The seq2seq architecture is also slower to fine-tune than encoder-only models due to the additional decoder. Fine-tuning BART for classification is possible (use the encoder’s [CLS] representation) but adds complexity.
Practitioner notes
If you’re building an NLP system that involves generating text from text — summarization, abstractive QA, dialogue, paraphrase — BART is where you start. Fine-tune facebook/bart-large-cnn (for news-style summarization) or facebook/bart-large (for other generation tasks). The fine-tuning signal is clean: feed the source through the encoder, train the decoder to generate the target. Standard cross-entropy, standard teacher forcing.
For tasks where you need both comprehension AND generation from the same model — for example, a document QA system that generates full answers rather than extracting spans — BART is the architecture to reach for. You don’t have to pick between BERT-style and GPT-style; BART gives you both from a single pre-trained checkpoint.
For summarization specifically: XSum-fine-tuned BART remains competitive years after publication. The +6 ROUGE gap over prior SOTA reflected a genuine qualitative improvement in abstractiveness, not just benchmark overfitting.
The T5 paper (published in parallel) pushed the same idea further — unified text-to-text for everything, more compute, better scaling — but BART remains the cleaner paper for understanding why seq2seq pre-training with the right noise function works. The BART ablations are one of the better empirical studies of what pre-training objectives actually learn.
Every modern generation-capable language model — mT5, Pegasus, FLAN-T5, mBART — owes its training recipe to what BART demonstrated: encode corrupted, decode clean, and everything else falls into place.
Connections
- encoder-decoder — the architecture BART uses: bidirectional encoder + autoregressive decoder
- pre-training — BART’s denoising objective is the pre-training task
- denoising — text infilling and other noise functions are the core training signal
- masked-language-model — BART’s text infilling generalizes BERT’s masked token prediction
- fine-tuning — BART is fine-tuned for generation (summarization, translation) and comprehension tasks
- attention-is-all-you-need — BART uses the standard seq2seq Transformer architecture from this paper
- t5-exploring-the-limits-of-transfer-learning — T5 pushes the same seq2seq pre-training idea further with unified text-to-text
- bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — BART’s encoder generalizes BERT; text infilling supersedes masked token prediction
Citation
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880.