Pre-training

What It Is

Pre-training is the process of training a model on a large, general dataset using a self-supervised objective before adapting it to specific tasks. The resulting model encodes broad language or visual knowledge that can be transferred to downstream tasks with minimal additional training.

Why It Matters

Pre-training concentrates expensive compute into a single large training run that can be amortized across hundreds of downstream tasks. Without it, every NLP application would require training large models from scratch on task-specific data, which is infeasible for most tasks that have only thousands of labeled examples. The pre-train then fine-tune paradigm is the dominant approach in modern NLP and vision.

How It Works

A model is trained on a self-supervised task that does not require human labels. Common pre-training objectives include:

Masked Language Modeling (BERT): predict randomly hidden tokens using bidirectional context
Next Token Prediction (GPT): predict the next token left-to-right
Contrastive objectives (CLIP): align paired modalities via similarity

Pre-training uses large, diverse corpora (BERT: BooksCorpus + Wikipedia, 3.3 billion words). The pre-trained weights capture syntax, semantics, and world knowledge as a side effect of the self-supervised task. Fine-tuning then adapts these weights to a specific downstream task with a small labeled dataset.

Key Sources

bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding
attention-is-all-you-need
language-models-are-unsupervised-multitask-learners
adam-a-method-for-stochastic-optimization
scaling-laws-for-neural-language-models
grpo-deepseekmath-group-relative-policy-optimization
codex-evaluating-large-language-models-trained-on-code
knowledge-distillation-hinton
llama-2-open-foundation-fine-tuned-chat-models
mae-masked-autoencoders-scalable-vision-learners
mixture-of-depths-dynamic-compute-allocation
proximal-policy-optimization
t5-exploring-the-limits-of-transfer-learning
bart-denoising-sequence-to-sequence-pre-training
batch-normalization-accelerating-deep-network-training
kto-model-alignment-as-prospect-theoretic-optimization
orpo-monolithic-preference-optimization
blip-2-bootstrapping-language-image-pretraining — two-stage pre-training: Stage 1 aligns Q-Former to frozen ViT; Stage 2 aligns to frozen LLM
training-compute-optimal-large-language-models — Chinchilla; establishes that pretraining compute is optimally split ~equally between model size and tokens (~20 tokens per parameter)
megatron-lm-training-multi-billion-parameter-language-models

ML Wiki

Explorer

Pre-training

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Pre-training

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks