BART: Denoising Sequence-to-Sequence Pre-training

By late 2019, the NLP community had two powerful but incompatible pre-training recipes. BERT’s bidirectional encoder was exceptional at understanding — it could see the full context on both sides of every token, making it ideal for classification, NER, and question answering. But BERT couldn’t generate text: its architecture had no decoder, no autoregressive mechanism. GPT’s causal decoder was exceptional at generation — it produced fluent text by predicting one token at a time — but saw only left context, making it weaker at comprehension tasks. Every practitioner had to pick a side. BART asked: what if you didn’t have to?

The core idea

The analogy: Think of a restoration archivist handed a damaged manuscript. The manuscript has passages blacked out, sentences scrambled, words missing. The archivist reads the whole damaged document carefully — using context from both before and after each damage — and then writes out the clean version from scratch. This process trains two distinct skills simultaneously: deep bidirectional reading (to understand what’s damaged and infer what’s missing) and fluent sequential writing (to produce the clean reconstruction).

That’s BART. The encoder reads a corrupted version of a document. The decoder generates the original clean document autoregressively, cross-attending to the encoder’s understanding at each generation step. Train long enough on this task and you have a model that’s simultaneously a strong bidirectional encoder (like BERT) and a strong autoregressive decoder (like GPT).

The mechanism, step by step:

Take a training document (any text).
Apply a noising function that corrupts the text: delete tokens, shuffle sentences, mask spans.
The corrupted document is fed to a bidirectional Transformer encoder — it processes all corrupted tokens in parallel, seeing full context in both directions.
A causal Transformer decoder generates the original clean document one token at a time, using causal self-attention over its own outputs so far, plus cross-attention to the encoder’s output at every layer.
The training loss is the standard cross-entropy between the decoder’s output and the original text.

The architecture is identical to the original seq2seq Transformer — 12 encoder + 12 decoder layers for BART-large (about 400M parameters total, similar to RoBERTa-large). No new architectural components. The entire contribution is in what you train it to do.

Five noise types, one winner:

The paper evaluates five ways to corrupt text:

1. TOKEN MASKING         (like BERT)
   Original:  "The cat sat on the mat"
   Corrupted: "The [M] sat on the [M]"
   Model knows: exactly N tokens are missing

2. TOKEN DELETION
   Original:  "The cat sat on the mat"
   Corrupted: "The cat on the mat"
   Model must find: which positions were deleted

3. TEXT INFILLING        ← the novel contribution
   Original:  "The cat sat on the mat"
   Corrupted: "The [M] mat"             ← "cat sat on the" → single [M]
   Model must infer: WHAT was removed AND HOW MANY tokens

4. SENTENCE PERMUTATION
   Original:  "Sent A. Sent B. Sent C."
   Corrupted: "Sent C. Sent A. Sent B."
   Model must restore: sentence order

5. DOCUMENT ROTATION
   Original:  "Token1 Token2 Token3..."
   Corrupted: "Token3... Token1 Token2"
   Model must identify: the start token

The winner: text infilling + sentence permutation combined. Text infilling alone is the best single-strategy for generation tasks. Sentence permutation alone is good for comprehension. Together they dominate on everything.

Why text infilling is the key insight:

BERT uses 1:1 token masking — one [MASK] per missing token. The model always knows exactly how many tokens to predict. It’s a fill-in-the-blank crossword: the number of blanks tells you the answer length.

BART’s text infilling replaces contiguous spans of variable length (drawn from a Poisson distribution with mean 3) with a single [MASK] token. One [MASK] might hide zero words, one word, or twelve words. The model has to predict both the content of the missing span AND its length.

“Text Infilling: A number of text spans are sampled, with span lengths drawn from a Poisson distribution (λ = 3). Each span is replaced with a single mask token. 0-length spans correspond to the insertion of mask tokens.”

This is strictly harder than BERT masking. The model must learn to look at surrounding context and reason: “given what comes before and after this [MASK], how long was the passage that’s missing?” That richer understanding is what makes BART particularly strong at generation.

The ASCII diagram:

PRETRAINING:

Original text:  "Paris is the capital of France and home to the Eiffel Tower"

Noising (text infilling, span = 4 tokens "capital of France and"):
Corrupted:      "Paris is the [M] home to the Eiffel Tower"

                    ┌──────────────────────────────────────────┐
Encoder input:  ── │ Paris  is  the  [M]  home  to  the  ...  │  (bidirectional)
(corrupted)         │    ↕    ↕    ↕    ↕    ↕    ↕    ↕      │
                    │       full context both directions        │
                    └──────────────────┬───────────────────────┘
                                       │ encoder hidden states
                                       │ (cross-attention)
                    ┌──────────────────▼───────────────────────┐
Decoder output: ── │ Paris  is  the  capital  of  France  ...  │  (left-to-right)
(original text)     └──────────────────────────────────────────┘

FINE-TUNING (summarization):

Article text ──→ [Encoder] ──→ hidden states ──→ [Decoder] ──→ Summary sentence
                 (reads full             (generates summary
                  article)               autoregressively)

Walkthrough with actual numbers:

Let’s trace text infilling on a concrete example with BART’s actual vocabulary size (50,265 tokens).

Original (6 tokens): "Paris is the capital of France"
Span length drawn from Poisson(λ=3): say we draw 4
Span selected: "the capital of France" (tokens 3–6)
Corrupted (3 tokens): "Paris is [MASK]"

ENCODER processes: ["Paris", "is", "[MASK]"]
  Produces hidden states: h_Paris, h_is, h_MASK ← all computed bidirectionally

DECODER generates the original, token by token:
  Step 1: input=<s>              → cross-attend to encoder → P(Paris) = 0.89
          loss = -log(0.89) = 0.116 nats

  Step 2: input=<s> Paris        → cross-attend to encoder → P(is) = 0.93
          loss = -log(0.93) = 0.073 nats

  Step 3: input=<s> Paris is     → cross-attend to encoder → P(the) = 0.67
          loss = -log(0.67) = 0.400 nats   ← harder: 4 tokens could follow [MASK]

  Step 4: input=<s> Paris is the → cross-attend to encoder → P(capital) = 0.78
          loss = -log(0.78) = 0.248 nats

  Step 5: ...of...               → P(of) = 0.91   → loss = 0.094 nats
  Step 6: ...France...           → P(France) = 0.94 → loss = 0.062 nats
  Step 7: ...France...           → P(</s>) = 0.96  → loss = 0.041 nats

Total reconstruction loss = 0.116 + 0.073 + 0.400 + 0.248 + 0.094 + 0.062 + 0.041
                           = 1.034 nats

BERT-style masking of the same 4 tokens would give 4 separate [MASK] tokens,
telling the model: "exactly 4 tokens missing here." The task is easier — 
the model never has to infer span length. BART's harder objective produces richer representations.

At decode step 3 (generating “the”), the cross-attention scores over encoder tokens tell the story:

h_Paris gets low weight (~0.08) — not relevant for infilling
h_is gets moderate weight (~0.20) — provides grammatical context
h_[MASK] gets high weight (~0.72) — the model’s attention is on the corrupted region

The model has learned to focus on the masked region to predict what’s missing, while using surrounding tokens for grammatical and semantic constraints.

What’s clever — find the instinct:

The seq2seq architecture for pre-training wasn’t new (T5 was developed in parallel). The clever move was recognizing that the choice of noise function isn’t a detail — it determines what the model learns. Previous work treated the noising function as an afterthought. BART’s ablations make it the central variable.

“We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme.”

The deeper insight: because the decoder must reconstruct the entire original (not just the masked spans), BART’s training is a stronger signal than BERT’s. BERT predicts 15% of tokens. BART’s decoder predicts 100% of tokens — it sees every position as a potential training signal. This is why BART fine-tunes better for generation tasks than BERT: it has spent pre-training practicing generation.

“Unlike GPT, we see all of the original sentence during pretraining due to the bidirectional encoder. Unlike BERT, we generate token by token with an autoregressive decoder.”

Results

Summarization (the killer result):

Task	BART	Previous SOTA	Gain
CNN/DailyMail (R-1)	44.16	43.68 (ProphetNet)	+0.48
CNN/DailyMail (R-2)	21.28	20.64	+0.64
XSum (R-1)	45.14	38.76 (BERTSumExtAbs)	+6.38
XSum (R-2)	22.27	16.33	+5.94
XSum (R-L)	37.25	31.15	+6.10

The XSum result is remarkable. XSum requires abstractive summarization — the summary uses different words than the article. BART’s generative pre-training directly targets this; extractive models (which copy spans) can’t compete. This +6 ROUGE gain is not incremental.

Comprehension (matches the best encoder-only models):

Task	BART	RoBERTa-large
SQuAD 1.1 (EM / F1)	88.8 / 94.6	88.9 / 94.6
GLUE (avg.)	88.4	88.5

BART matches RoBERTa-large on comprehension tasks — using the same pre-training data and compute — while also being a strong generative model. RoBERTa cannot generate. BART can do both.

Noise function ablation (which corruption matters):

Noising scheme	XSum R-2	CNN/DM R-2
Token masking	6.1	12.8
Token deletion	8.2	14.5
Text infilling	13.5	16.5
Sent. permutation	7.8	13.9
Text infilling + sent. perm.	14.5	17.1

Token masking (pure BERT-style) is the weakest. Text infilling is the single strongest component. The combination is the winner.

What breaks:

BART’s 400M parameters make it expensive relative to encoder-only models for comprehension-only tasks. If you only need classification, RoBERTa with 355M parameters is simpler and equally good.

Text infilling doesn’t help with factual tasks where the model needs to retrieve specific facts (named entities, dates). A model that has memorized Wikipedia will outperform BART on knowledge-intensive QA tasks — this is the gap that RAG systems later addressed.

The seq2seq architecture is also slower to fine-tune than encoder-only models due to the additional decoder. Fine-tuning BART for classification is possible (use the encoder’s [CLS] representation) but adds complexity.

Practitioner notes

If you’re building an NLP system that involves generating text from text — summarization, abstractive QA, dialogue, paraphrase — BART is where you start. Fine-tune facebook/bart-large-cnn (for news-style summarization) or facebook/bart-large (for other generation tasks). The fine-tuning signal is clean: feed the source through the encoder, train the decoder to generate the target. Standard cross-entropy, standard teacher forcing.

For tasks where you need both comprehension AND generation from the same model — for example, a document QA system that generates full answers rather than extracting spans — BART is the architecture to reach for. You don’t have to pick between BERT-style and GPT-style; BART gives you both from a single pre-trained checkpoint.

For summarization specifically: XSum-fine-tuned BART remains competitive years after publication. The +6 ROUGE gap over prior SOTA reflected a genuine qualitative improvement in abstractiveness, not just benchmark overfitting.

The T5 paper (published in parallel) pushed the same idea further — unified text-to-text for everything, more compute, better scaling — but BART remains the cleaner paper for understanding why seq2seq pre-training with the right noise function works. The BART ablations are one of the better empirical studies of what pre-training objectives actually learn.

Every modern generation-capable language model — mT5, Pegasus, FLAN-T5, mBART — owes its training recipe to what BART demonstrated: encode corrupted, decode clean, and everything else falls into place.

Connections

encoder-decoder — the architecture BART uses: bidirectional encoder + autoregressive decoder
pre-training — BART’s denoising objective is the pre-training task
denoising — text infilling and other noise functions are the core training signal
masked-language-model — BART’s text infilling generalizes BERT’s masked token prediction
fine-tuning — BART is fine-tuned for generation (summarization, translation) and comprehension tasks
attention-is-all-you-need — BART uses the standard seq2seq Transformer architecture from this paper
t5-exploring-the-limits-of-transfer-learning — T5 pushes the same seq2seq pre-training idea further with unified text-to-text
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding — BART’s encoder generalizes BERT; text infilling supersedes masked token prediction

Citation

arXiv:1910.13461

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880.

ML Wiki

Explorer

BART: Denoising Sequence-to-Sequence Pre-training

The core idea

Results

Practitioner notes

Connections

Citation

Graph View

Table of Contents

Backlinks