This path traces the full lifecycle of a large language model: how we know how big to build it, how the core mechanism works, how it’s trained efficiently, how it’s aligned to human intent, and how generation is made fast enough to serve.
Step 1 — Scaling Laws
Before building anything, understand the map. Scaling laws tell you that model loss follows a power law over parameters, data, and compute — and that you can measure this cheaply at small scale and extrapolate reliably. This is the foundation that makes the rest of the field legible: every architectural and training choice is ultimately validated against these curves.
Step 2 — Attention
Scaling laws tell you how much to build. Attention tells you what to build. The transformer’s attention mechanism is the architectural primitive that replaced RNNs and made large-scale parallel training possible. Understanding Q/K/V, the scaling factor, and multi-head structure is required before anything downstream makes sense.
Step 3 — Flash Attention
The naive attention implementation doesn’t scale. The quadratic memory cost of materializing the full attention matrix makes training on long sequences impossible on real hardware. Flash Attention rewrites the algorithm to use tiled computation that stays in fast SRAM — same math, orders of magnitude less memory traffic. This is what made training on 2K+ context feasible.
Step 4 — GPT-3 / Language Models are Few-Shot Learners
language-models-are-few-shot-learners
With the mechanism and efficiency in place, GPT-3 demonstrated what happens when you scale attention-based pretraining to 175B parameters on 300B tokens. The central result: in-context learning. The model wasn’t fine-tuned for tasks — it learned to do them from examples in the prompt. This paper established the modern paradigm.
Step 5 — SFT (Supervised Fine-Tuning)
Pretraining produces a capable but unruly model that will complete any text, including harmful completions. SFT takes a pretrained model and fine-tunes it on curated (prompt, ideal response) pairs. This is the first alignment step: teaching the model to be helpful rather than just fluent. It’s also what makes the model follow instructions at all.
Step 6 — RLHF
SFT on curated examples has limits: human-written ideal responses are expensive and don’t cover the full distribution of user requests. RLHF moves beyond this by training on preference comparisons (which response is better?) rather than absolute labels. A reward model is trained on these comparisons, then PPO fine-tunes the language model against the reward model’s scores. This is how InstructGPT and ChatGPT were trained.
Step 7 — DPO
RLHF works but the PPO loop is unstable and requires four models in memory simultaneously. DPO derives a closed-form equivalent: the reward model is implicit in the policy’s log-probability ratios. This reduces alignment training to a simple binary cross-entropy loss on preference pairs. Increasingly the default approach for fine-tuning aligned models.
Step 8 — Speculative Decoding
The model is trained. Now you need to serve it fast. Autoregressive generation is inherently serial — one token per forward pass. Speculative decoding breaks this by using a small draft model to propose K tokens at once, then verifying all K in a single target model pass. The math guarantees output is identical to greedy sampling from the large model. This is the primary latency optimization in production LLM serving.