Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)

Concepts: transfer-learning | pre-training | encoder-decoder | fine-tuning | scaling-laws Builds on: attention-is-all-you-need Leads to: switch-transformer-sparse-mixture-of-experts

NLP in 2019 had a problem: too many recipes. BERT for classification, GPT for generation, seq2seq models for translation, extractive models for QA. Each task had its own architecture, its own training objective, its own fine-tuning protocol. Comparing them was like comparing recipes written in different units — you couldn’t tell if BERT beat a seq2seq model on QA because of the architecture, the pre-training objective, the data, or the fine-tuning strategy.

T5 asked: what if we made everything the same?

The core idea

The analogy: A Swiss army knife vs. a drawer full of specialized tools. The specialists may win on their single task, but the Swiss army knife lets you ask: which blade shape actually matters most? T5 is the Swiss army knife — same body, same training, same objective for every task. Now you can isolate variables and actually learn what makes transfer learning work.

The mechanism is almost embarrassingly simple:

“The basic idea underlying our work is to treat every text processing problem as a ‘text-to-text’ problem, i.e. taking text as input and producing new text as output.”

Translation? Text-to-text. Classification? Text-to-text. Summarization? Text-to-text. Question answering? Text-to-text. To tell the model which task it’s doing, you prepend a text prefix:

"translate English to German: That is good." → "Das ist gut."
"summarize: The quick brown fox..." → "A fox jumps over a dog."
"mnli premise: I hate pigeons. hypothesis: My feelings towards pigeons are filled with animosity." → "entailment"
"question: Who wrote Hamlet? context: William Shakespeare wrote Hamlet in..." → "William Shakespeare"

Everything is just text in, text out. Same loss function: standard cross-entropy. Same optimizer: AdaFactor. Same decoding: greedy.

BEFORE T5 — TASK-SPECIFIC ARCHITECTURES:

Classification → [CLS] token → linear head → class logits
Summarization → encoder-decoder → generate summary
QA → extract span from passage → start/end indices
Translation → seq2seq → generate target language

(Different models, objectives, fine-tuning for each)

AFTER T5 — ONE FRAMEWORK:

All tasks → "prefix: input text" → T5 → "output text"

Classification: "mnli premise: ... hypothesis: ..." → "entailment"
Summarization: "summarize: ..."               → "A fox jumps..."
QA:            "question: ... context: ..."    → "William Shakespeare"
Translation:   "translate English to German: ..."→ "Das ist gut."

(Same model, same loss, same everything)

The pre-training objective: span corruption

T5 pre-trains on the C4 dataset (Colossal Clean Crawled Corpus — 750GB of cleaned web text) using a denoising objective. Specifically: randomly mask 15% of tokens, replace each consecutive span of masked tokens with a single sentinel token, then train the model to predict all the dropped spans.

Input: "Thank you <X> me to your party <Y> week." Target: "<X> for inviting <Y> last <Z>"

“All consecutive spans of dropped-out tokens are replaced by a single sentinel token. Each sentinel token is assigned a token ID that is unique to the sequence.”

This is more efficient than BERT’s masked LM: you only predict the dropped tokens (not the full sequence), and sentinels compress multiple tokens into one. The encoder sees the gapped input; the decoder generates the missing spans.

The systematic study

T5’s real contribution isn’t a new architecture — it’s a controlled experiment. The paper varies one factor at a time while holding everything else constant:

Architecture: encoder-decoder vs. encoder-only vs. decoder-only
Pre-training objective: span corruption vs. masked LM vs. prefix LM vs. causal LM
Pre-training dataset: C4 vs. Wikipedia vs. Common Crawl (unfiltered)
Transfer approach: full fine-tuning vs. adapter layers vs. gradual unfreezing
Scale: 220M → 3B → 11B parameters

“We emphasize that our goal is not to propose new methods but instead to provide a comprehensive perspective on where the field stands.”

Find the instinct: why encoder-decoder beats encoder-only for everything

The key architectural finding: encoder-decoder is strictly better than encoder-only for generative tasks (obviously) and matches encoder-only for classification tasks. This is surprising. BERT, an encoder-only model, dominates classification benchmarks in 2019. T5 shows that an encoder-decoder with the same total parameters achieves essentially the same classification performance — but can also translate, summarize, and generate freely.

Why? The encoder-decoder architecture can be used as an encoder-only model by just using the encoder’s representation. But it can’t be used in reverse. It’s strictly more general.

The span corruption objective also wins over simpler alternatives. Replacing spans with sentinels (rather than masking individual tokens) forces the model to model local coherence at the span level — a harder and more transferable objective.

Walkthrough with actual numbers: the span corruption objective

Take the sentence: "The quick brown fox jumps over the lazy dog." (9 tokens)

Step 1: Sample 15% for masking → ~1-2 tokens. Say tokens 2 and 5 are selected (“brown”, “jumps”).

Step 2: Identify consecutive spans:

Span A: token 2 (“brown”) — standalone
Span B: token 5 (“jumps”) — standalone

Step 3: Replace each span with a sentinel:

Input: "The quick <X> fox <Y> over the lazy dog."
Target: "<X> brown <Y> jumps <Z>"

Step 4: Compute cross-entropy loss on target sequence only (not the full input). For a vocabulary of 32,000 tokens:

$L = - \frac{1}{∣ T ∣} \sum_{t \in T} lo g P (w_{t} ∣ w_{< t}, encoder output)$

where $T = {<X>, brown, <Y>, jumps, <Z>}$ — only the dropped tokens plus sentinels.

For the token “brown” (vocabulary ID ~4,521), if the model assigns probability 0.3:

$L_{brown} = - lo g (0.3) \approx 1.20$

Compare to masked LM loss, which computes over all 9 tokens: T5’s span corruption is ~3x more compute-efficient per training step because the decoder only generates 5 tokens instead of 9.

C4: the dataset that matters as much as the architecture

The paper discovers that data quality has enormous impact. Raw Common Crawl is mostly garbage — menus, error messages, boilerplate, duplicate text. After filtering:

Keep only lines ending in terminal punctuation
Discard pages with < 3 sentences
Remove any page with offensive words
Remove any page containing JavaScript or lorem ipsum
Deduplicate any 3-sentence span appearing more than once

Result: 750GB of clean English text — the C4 dataset.

Ablations show that using raw (unfiltered) Common Crawl produces measurably worse results. Data quality matters at scale.

Does it work? What breaks?

Baseline model (T5-Base, 220M parameters) vs. no pre-training:

Task	With Pre-training	Without Pre-training	Gain
GLUE avg	83.28	66.22	+17.1
SQuAD	80.88	50.31	+30.6
SuperGLUE avg	71.36	53.04	+18.3
EnDe BLEU	26.98	25.86	+1.1
CNN/DM ROUGE-2	19.24	17.60	+1.6

Translation (EnDe) shows minimal gain from pre-training — the training set is large enough that a good seq2seq model can learn it from scratch. Every other task shows massive gains, confirming that pre-training primarily helps in data-scarce settings.

T5-11B (the final model): Combined insights from the study + scale + more training steps → SOTA on GLUE, SuperGLUE, SQuAD, CNN/DM, and translation benchmarks at the time of publication. The 11B model trained on ~1T tokens outperforms all previous models on SuperGLUE, including ensemble models.

What doesn’t work:

Text-to-text has one awkward case: regression. STS-B asks the model to predict a similarity score between 1 and 5. The fix is to round to the nearest 0.2 and predict the string “2.6” — a classification hack. It works, but it’s inelegant. Tasks that require structured output (e.g. parse trees, molecule structures) are similarly awkward.

The text prefix approach also requires that the model learn to distinguish tasks from prefix strings alone — no architectural separation. This means if the prefix is ambiguous or unusual, behavior is undefined. The paper notes that changing the exact wording of the prefix had limited impact, but that’s for the tasks they tested.

Multi-task learning without fine-tuning consistently underperforms task-specific fine-tuning. The unified framework works best as a pre-training approach, not a zero-shot approach.

Practitioner notes

If you’re building NLP pipelines, T5’s framework is now the default template. The text-to-text format means you can fine-tune a single model checkpoint on multiple tasks sequentially or jointly, and you can add a new task without any architectural changes — just a new prefix.

The span corruption pre-training objective is implemented in most modern seq2seq pre-training runs (BART uses a similar but more aggressive corruption). If you’re pre-training a seq2seq model from scratch, span corruption at 15% is a safe default.

The C4 dataset is publicly available via TensorFlow Datasets and is the standard pre-training corpus for many T5 variants (mT5, ByT5, Flan-T5). If you’re pre-training on web text, C4’s filtering heuristics are a reasonable starting point.

Scale matters linearly in T5’s study: every doubling of parameters improves results. But the gains compound when you scale both model size and training data simultaneously. T5-11B pre-trained on 1T tokens beats T5-3B pre-trained on the same data — consistent with scaling-laws that emerge more clearly in later work.

Flan-T5 (2022) instruction-tunes T5 on 1,836 tasks and dramatically improves zero-shot performance — the text-to-text format turns out to be particularly well-suited to instruction following. If you’re using T5 variants today, reach for Flan-T5 rather than the original.

Connections

transfer-learning — T5 is the definitive study of transfer learning for NLP
pre-training — span corruption objective on C4 is the core pre-training method
encoder-decoder — T5’s finding that encoder-decoder beats encoder-only for the general case
fine-tuning — systematic comparison of full fine-tuning vs. adapters vs. gradual unfreezing
scaling-laws — T5’s scaling experiments foreshadow the formal scaling laws of Chinchilla
attention-is-all-you-need — T5 uses the standard Transformer architecture from this paper
switch-transformer-sparse-mixture-of-experts — Switch Transformer uses T5 as its baseline and builds on C4/T5 infrastructure

Citation

arXiv:1910.10683

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140), 1–67.

ML Wiki

Explorer

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)

The core idea

Does it work? What breaks?

Practitioner notes

Connections

Citation

Graph View

Table of Contents

Backlinks