Fine-tuning

What It Is

Fine-tuning is the process of taking a pre-trained model and continuing to train it on a smaller, task-specific dataset. All or most of the pre-trained weights are updated at a low learning rate, adapting the general representations to the specific task distribution.

Why It Matters

Fine-tuning is what makes pre-training economically viable at scale. Train once at enormous cost, fine-tune cheaply for every application. BERT demonstrated that fine-tuning with a single additional linear layer achieves state-of-the-art results on 11 NLP tasks — replacing custom task-specific architectures that took years of engineering.

How It Works

The standard BERT fine-tuning procedure:

Initialize with pre-trained weights
Add a small task-specific output layer (e.g., a linear classifier W in R^(K x H))
Feed task inputs in the same format as pre-training (token sequences with [CLS] and [SEP])
Train all parameters end-to-end at low learning rate (2e-5 to 5e-5) for 2-4 epochs
Select checkpoint on dev set

For classification tasks, the [CLS] token’s final hidden state is the input to the output layer. For token-level tasks (NER, QA span extraction), per-token final hidden states are used. Fine-tuning a BERT-Large model on a single task takes roughly 1 hour on a Cloud TPU — a small fraction of the multi-day pre-training cost.

Full fine-tuning updates all pre-trained parameters, which produces the best task performance but requires storing a full model copy per task. Parameter-efficient methods like LoRA address this by keeping pre-trained weights frozen and inserting small trainable adapters.

Key Sources

bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
lora-low-rank-adaptation — parameter-efficient alternative to full fine-tuning
bert-pre-training-of-deep-bidirectional-transformers
training-language-models-to-follow-instructions-with-human-feedback
codex-evaluating-large-language-models-trained-on-code
qlora-efficient-finetuning-quantized-llms
t5-exploring-the-limits-of-transfer-learning
bart-denoising-sequence-to-sequence-pre-training
learning-to-summarize-human-feedback — uses fine-tuning at every stage: SFT baseline, reward model init, and PPO policy update

ML Wiki

Explorer

Fine-tuning

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Fine-tuning

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks