How LLMs Are Trained — From Scratch to RLHF

This path traces the full lifecycle of a large language model: how we know how big to build it, how the core mechanism works, how it’s trained efficiently, how it’s aligned to human intent, and how the field has matured through successive generations of models.

Step 1 — Scaling Laws

scaling-laws

Before building anything, understand the map. Scaling laws tell you that model loss follows a power law over parameters, data, and compute — and that you can measure this cheaply at small scale and extrapolate reliably. This is the foundation that makes the rest of the field legible: every architectural and training choice is ultimately validated against these curves.

Step 2 — Attention

attention

Scaling laws tell you how much to build. Attention tells you what to build. The transformer’s attention mechanism is the architectural primitive that replaced RNNs and made large-scale parallel training possible. Understanding Q/K/V, the scaling factor, and multi-head structure is required before anything downstream makes sense.

Step 3 — Flash Attention

flash-attention

The naive attention implementation doesn’t scale. The quadratic memory cost of materializing the full attention matrix makes training on long sequences impossible on real hardware. Flash Attention rewrites the algorithm to use tiled computation that stays in fast SRAM — same math, orders of magnitude less memory traffic. This is what made training on 2K+ context feasible.

Step 4 — GPT-2 / Language Models are Unsupervised Multitask Learners

language-models-are-unsupervised-multitask-learners

Before scaling to hundreds of billions of parameters, this is the paper that proved the idea. GPT-2 showed that a language model trained purely on next-token prediction — no task labels, no fine-tuning — could perform translation, summarization, and reading comprehension in a zero-shot setting. The central claim: language modeling at scale is implicitly multitask learning. This established the pretraining paradigm that everything else builds on.

Step 5 — GPT-3 / Language Models are Few-Shot Learners

language-models-are-few-shot-learners

With the mechanism and efficiency in place, GPT-3 demonstrated what happens when you scale GPT-2’s approach to 175B parameters on 300B tokens. The central result: in-context learning. The model wasn’t fine-tuned for tasks — it learned to do them from examples in the prompt. This paper established the modern paradigm and showed the qualitative shift that emerges from scale.

Step 6 — SFT (Supervised Fine-Tuning)

sft

Pretraining produces a capable but unruly model that will complete any text, including harmful completions. SFT takes a pretrained model and fine-tunes it on curated (prompt, ideal response) pairs. This is the first alignment step: teaching the model to be helpful rather than just fluent. LIMA showed that 1,000 carefully curated examples can match models trained on orders of magnitude more data — quality of demonstrations matters far more than quantity.

Step 7 — LIMA: Less Is More for Alignment

lima-less-is-more-for-alignment

This paper stress-tests SFT: what happens if you fine-tune on only 1,000 human-selected examples, chosen for diversity and quality? The result outcompetes models trained on 52K+ examples (Alpaca) and approaches GPT-4 on many tasks. The implication — that alignment is primarily about style and format selection from a capable base, not knowledge injection — reshaped how practitioners think about SFT data collection. Read this alongside the SFT concept.

Step 8 — Learning to Summarize from Human Feedback

learning-to-summarize-human-feedback

This is the original RLHF paper. Before InstructGPT, before ChatGPT, this paper from OpenAI showed that training a reward model on human preference labels and then using it to fine-tune a summarization model produces dramatically better results than supervised fine-tuning on reference summaries. The dataset, reward modeling setup, and PPO training loop described here are the direct ancestors of InstructGPT’s methodology. Read this before studying the RLHF concept.

Step 9 — PPO / Proximal Policy Optimization

proximal-policy-optimization

PPO is the reinforcement learning algorithm that makes RLHF tractable. The naive policy gradient updates can move the policy too far in one step, causing instability. PPO constrains updates with a clipped objective that prevents large policy swings while still making progress. This stability property is why it became the default for RLHF — language model fine-tuning with RL is already unstable enough without the algorithm adding instability of its own.

Step 10 — RLHF

rlhf

With the history in place, the RLHF concept page synthesizes how reward modeling and PPO compose into the full alignment pipeline. A reward model is trained on preference comparisons, then PPO fine-tunes the language model against the reward model’s scores, with a KL penalty to prevent the policy from drifting too far from the SFT model. This is how InstructGPT and ChatGPT were trained.

Step 11 — DPO / Direct Preference Optimization

dpo

RLHF works but the PPO loop is unstable and requires four models in memory simultaneously. DPO derives a closed-form equivalent: the reward model is implicit in the policy’s log-probability ratios. This reduces alignment training to a simple binary cross-entropy loss on preference pairs. Increasingly the default approach for fine-tuning aligned models.

Step 12 — KTO / Model Alignment as Prospect Theoretic Optimization

kto-model-alignment-prospect-theoretic-optimization

DPO requires paired preference data (A is better than B). KTO relaxes this: it only needs binary signals — was this response good or bad? — without needing to compare two responses to the same prompt. Grounded in Kahneman-Tversky prospect theory, KTO’s loss function reflects how humans actually weigh gains versus losses asymmetrically. In practice this means existing datasets of labeled good/bad completions can be used directly for alignment.

Step 13 — LLaMA 2

llama-2-open-foundation-fine-tuned-chat-models

LLaMA 2 is the reference point for open-weight aligned models. The paper covers the full training pipeline: pretraining on 2T tokens, SFT on high-quality demonstrations, iterative RLHF with rejection sampling, and safety-focused fine-tuning. Unlike GPT-3 and ChatGPT, the weights are public — which means the training details are directly reproducible. After studying the alignment techniques, this is how they look assembled into a production system.

Step 14 — Constitutional AI

constitutional-ai-harmlessness-from-ai-feedback

Constitutional AI (Anthropic, 2022) addresses a scaling bottleneck in RLHF: human feedback is expensive. Instead, a set of principles (the “constitution”) guides a model to critique and revise its own outputs, and those revised outputs are used to train a preference model. This closes the human-in-the-loop for a large fraction of the preference data. RLAIF (RL from AI Feedback) is now a standard variant of the alignment pipeline.

Step 15 — DeepSeek-R1

deepseek-r1-reasoning-via-reinforcement-learning

DeepSeek-R1 shows what happens when you apply RL not to human preference but to verifiable correctness — math problems, code, logical puzzles. The model learns to generate long chain-of-thought traces before answering, with the RL signal coming from whether the final answer is right. This produced reasoning capabilities that match or exceed frontier models on benchmarks, emerging from RL alone without explicit chain-of-thought supervision.

Step 16 — GPT-4 Technical Report

gpt-4-technical-report

The endpoint of the path. GPT-4 is the culmination of scaling laws, efficient attention, sophisticated alignment pipelines, and RLHF at scale. The technical report is sparse on training details but covers the predictability of capabilities from smaller runs, the multi-modal training setup, and the RLHF methodology. Reading it after the preceding 15 steps makes visible what it omits — and why those omissions are themselves informative about the state of the field.

ML Wiki

Explorer