What It Is

Fine-tuning is the process of taking a pre-trained model and continuing to train it on a smaller, task-specific dataset. All or most of the pre-trained weights are updated at a low learning rate, adapting the general representations to the specific task distribution.

Why It Matters

Fine-tuning is what makes pre-training economically viable at scale. Train once at enormous cost, fine-tune cheaply for every application. BERT demonstrated that fine-tuning with a single additional linear layer achieves state-of-the-art results on 11 NLP tasks — replacing custom task-specific architectures that took years of engineering.

How It Works

The standard BERT fine-tuning procedure:

  1. Initialize with pre-trained weights
  2. Add a small task-specific output layer (e.g., a linear classifier W in R^(K x H))
  3. Feed task inputs in the same format as pre-training (token sequences with [CLS] and [SEP])
  4. Train all parameters end-to-end at low learning rate (2e-5 to 5e-5) for 2-4 epochs
  5. Select checkpoint on dev set

For classification tasks, the [CLS] token’s final hidden state is the input to the output layer. For token-level tasks (NER, QA span extraction), per-token final hidden states are used. Fine-tuning a BERT-Large model on a single task takes roughly 1 hour on a Cloud TPU — a small fraction of the multi-day pre-training cost.

Full fine-tuning updates all pre-trained parameters, which produces the best task performance but requires storing a full model copy per task. Parameter-efficient methods like LoRA address this by keeping pre-trained weights frozen and inserting small trainable adapters.

Key Sources