What It Is

Pre-training is the process of training a model on a large, general dataset using a self-supervised objective before adapting it to specific tasks. The resulting model encodes broad language or visual knowledge that can be transferred to downstream tasks with minimal additional training.

Why It Matters

Pre-training concentrates expensive compute into a single large training run that can be amortized across hundreds of downstream tasks. Without it, every NLP application would require training large models from scratch on task-specific data, which is infeasible for most tasks that have only thousands of labeled examples. The pre-train then fine-tune paradigm is the dominant approach in modern NLP and vision.

How It Works

A model is trained on a self-supervised task that does not require human labels. Common pre-training objectives include:

  • Masked Language Modeling (BERT): predict randomly hidden tokens using bidirectional context
  • Next Token Prediction (GPT): predict the next token left-to-right
  • Contrastive objectives (CLIP): align paired modalities via similarity

Pre-training uses large, diverse corpora (BERT: BooksCorpus + Wikipedia, 3.3 billion words). The pre-trained weights capture syntax, semantics, and world knowledge as a side effect of the self-supervised task. Fine-tuning then adapts these weights to a specific downstream task with a small labeled dataset.

Key Sources