What It Is
Fine-tuning is the process of taking a pre-trained model and continuing to train it on a smaller, task-specific dataset. All or most of the pre-trained weights are updated at a low learning rate, adapting the general representations to the specific task distribution.
Why It Matters
Fine-tuning is what makes pre-training economically viable at scale. Train once at enormous cost, fine-tune cheaply for every application. BERT demonstrated that fine-tuning with a single additional linear layer achieves state-of-the-art results on 11 NLP tasks — replacing custom task-specific architectures that took years of engineering.
How It Works
The standard BERT fine-tuning procedure:
- Initialize with pre-trained weights
- Add a small task-specific output layer (e.g., a linear classifier W in R^(K x H))
- Feed task inputs in the same format as pre-training (token sequences with [CLS] and [SEP])
- Train all parameters end-to-end at low learning rate (2e-5 to 5e-5) for 2-4 epochs
- Select checkpoint on dev set
For classification tasks, the [CLS] token’s final hidden state is the input to the output layer. For token-level tasks (NER, QA span extraction), per-token final hidden states are used. Fine-tuning a BERT-Large model on a single task takes roughly 1 hour on a Cloud TPU — a small fraction of the multi-day pre-training cost.
Full fine-tuning updates all pre-trained parameters, which produces the best task performance but requires storing a full model copy per task. Parameter-efficient methods like LoRA address this by keeping pre-trained weights frozen and inserting small trainable adapters.
Key Sources
-
bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
-
lora-low-rank-adaptation — parameter-efficient alternative to full fine-tuning
-
training-language-models-to-follow-instructions-with-human-feedback