Concepts: data-quality | scaling-laws | compute-optimal-training | pre-training Builds on: training-compute-optimal-large-language-models | scaling-laws-for-neural-language-models
Chinchilla said: train more tokens than you thought. Compute-optimal for a given budget. Phi-3 says: that scaling law was the floor, not the ceiling. With sufficiently curated data — heavy filtering of the web plus synthetic data generated by stronger models — a 3.8B parameter model can match the quality of Mixtral 8x7B (12.9B active) and reach within striking distance of GPT-3.5. The implication is that “data quality” is a separate axis from “data quantity,” and the field had been treating them as if they were the same.
The core idea
The analogy: Two students prepare for the same exam. Student A reads every textbook, every blog post, every Reddit thread on the topic — 10,000 pages of mixed-quality material. Student B has a tutor (a top student) who curates the 1,000 most pedagogically useful pages and writes 500 pages of fresh worked examples specifically calibrated to the exam. On test day, Student B wins, despite reading 10x less.
Phi-3-mini is Student B. It is trained on roughly 3.3T tokens — comparable to LLaMA-2 7B’s 2T — but the composition of those tokens is radically different:
- Heavily filtered web data. Where most prior models filtered at the document level (keep / drop), Phi-3 filters at finer granularity, using classifier-based scoring trained on human and model-generated quality labels.
- Synthetic data generated by GPT-4-class teachers. Worked examples, textbook-style explanations, problem-solution pairs. Generated to fill specific capability gaps the team identified.
“Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.”
The architecture is unremarkable: standard decoder-only transformer, 32 layers, 32 attention heads, 3072 hidden dim, GQA. The recipe is the data.
What’s clever — find the instinct
The non-obvious move is rejecting the Chinchilla-optimal regime on purpose. Chinchilla says for a 3.8B model with optimal compute, train on ~80B tokens. Phi-3-mini trains on 3.3T — roughly 40x what compute-optimal would suggest. This is “over-training” by Chinchilla’s framework, and yet it produces a much stronger model than a compute-optimal 3.8B would.
The reason: Chinchilla optimizes for a specific objective (validation loss on a fixed text distribution) under a specific assumption (training data quality is held fixed). When you can change the data distribution itself, the optimum shifts. A small model trained on extremely high-quality data outperforms a larger model trained on noisy data, and the small model’s inference cost is permanently lower.
“Phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.”
The broader Phi philosophy (“textbooks are all you need,” from Phi-1/Phi-2): the bottleneck for small models is not parameter count but density of useful information per training token. If you can generate textbook-quality data at scale, parameter efficiency goes up.
The second clever move is the data-curation feedback loop. The team uses internal evals to identify capability gaps, then generates synthetic data targeted at those gaps, then re-trains. This is data-centric AI: the recipe is “look at where the model fails, generate examples that fix it.” This is essentially RLHF-as-data-generation, before the alignment stage.
“We… target the inference efficiency of the model… ensure that the model is robust against adversarial inputs.”
The third clever move is the family. Phi-3-mini (3.8B) ships with phi-3-small (7B), phi-3-medium (14B), then phi-3.5-mini, phi-3.5-MoE (16x3.8B = 6.6B active), phi-3.5-Vision. Each is the same data-quality recipe at different scales. The MoE version, in particular, achieves 6.6B active parameters with the breadth of a 60B-parameter model.
Walkthrough: phi-3-mini sits on a phone
Setup: phi-3-mini, 3.8B parameters, 4-bit quantized.
Memory footprint: 3.8B * 0.5 bytes = 1.9 GB.
iPhone 14: 6 GB RAM. Plenty of room.
Run on-device:
Prompt: "Translate this email to formal German: ..."
Latency to first token: ~300 ms on iPhone Neural Engine.
Throughput: ~12 tokens/sec sustained.
Battery cost: ~5% per 1000 tokens (rough).
Compare to GPT-3.5 quality (cloud-only):
On MMLU: GPT-3.5 ≈ 70%, phi-3-mini = 69%. Nearly tied.
On MT-bench: GPT-3.5 ≈ 8.4, phi-3-mini = 8.38. Tied.
Cloud round-trip: 200-1500 ms latency, $0.001-0.01 per call.
On-device: zero latency, zero cost, full privacy.
Compare to LLaMA-2 7B (the size class people previously aimed for on edge):
LLaMA-2 7B on MMLU: ~46%.
phi-3-mini on MMLU: 69%.
Phi is half the size and dramatically more capable.
Difference: training data quality.
The deployment story is the punch line. A model good enough to draft emails, answer factual questions, and do basic reasoning, sitting on a phone with no network. This was implausible a year before this paper.
Does it work? What breaks?
| Model | Params | MMLU | MT-bench | HumanEval |
|---|---|---|---|---|
| phi-3-mini | 3.8B | 69.0 | 8.38 | 58.5 |
| Mistral 7B | 7.0B | 60.3 | 7.6 | 28.0 |
| LLaMA-2 7B | 7.0B | 46.0 | 6.95 | 12.8 |
| Mixtral 8x7B | 12.9B active | 70.0 | 8.30 | 37.8 |
| phi-3-small | 7B | 75.7 | — | 61.0 |
| phi-3-medium | 14B | 78.0 | — | 62.2 |
| GPT-3.5 (estimated) | ~175B | ~70 | 8.4 | ~67 |
Phi-3-mini at 3.8B effectively matches Mixtral 8x7B (12.9B active) on MMLU and exceeds it on MT-bench. Phi-3-medium at 14B beats Mixtral on MMLU.
“Phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.”
What breaks:
- Factual knowledge ceiling. Small models simply cannot store as many facts as large ones. Phi-3-mini fails on long-tail factual questions where Mixtral or GPT-4 succeed. The mitigation is RAG, not parameter scaling.
- Multilingual is uneven. The data composition is English-heavy; non-English performance lags.
- Synthetic-data risks. A model trained on GPT-4-generated text inherits GPT-4’s biases and failure modes; it also risks “model collapse” if the chain were repeated indefinitely. Phi-3 mixes synthetic with web-scraped to mitigate.
- Benchmark contamination concerns. Critics note that phi-3 numbers on benchmarks like MMLU may be partly explained by training data containing similar question-answer formats. The team disputes this; the actual answer is somewhere in between.
- Reasoning depth. On Big-Bench Hard or MATH, phi-3 does well but not at GPT-4 level. Compositional reasoning and multi-step proofs remain harder for small models.
So what?
For a practitioner deciding what model to deploy:
- Re-evaluate the small-model assumption. Three years ago, “small model” meant “useless for production.” Phi-3 changed that. A 3.8B-7B model is genuinely capable for many tasks (summarization, simple QA, code completion, classification).
- Edge / on-device inference is now viable. With 4-bit quantization, phi-3-mini fits in 1.9 GB. iPhone, mid-range Android, embedded SoCs can all run it. Privacy-sensitive applications (health, finance, on-device assistants) can stay local.
- Data quality > parameter count for many use cases. If you can curate or generate high-quality data for your task, a smaller model trained on that data may outperform a larger model trained on web crawl.
- Generation pipelines beat scrape pipelines. Phi-3’s textbook-style synthetic data is what closed the gap with larger models. A startup doing a vertical-specific LLM should consider generating more data with a stronger teacher rather than scaling parameters.
- For RAG systems, smaller models are increasingly fine. With retrieval handling factual recall, the LLM only needs to reason and synthesize. A 3.8B model can do that, and at 10-100x lower inference cost than a 70B+ model.
For Saikat’s L5 interview prep: Phi-3 is the modern counter-argument to “scaling laws say you must go big.” Knowing both sides — Chinchilla’s compute-optimal framework and Phi-3’s data-quality counter-argument — is what distinguishes a senior systems engineer from a model-trainer-by-numbers. The synthesis: scaling laws still hold given a fixed data distribution; the way to beat them is to change the distribution.
Connections
- data-quality — primary lever Phi-3 pulls
- scaling-laws — the framework Phi-3 partly subverts via data-quality
- compute-optimal-training — Phi-3 explicitly trains past the Chinchilla optimum
- pre-training — recipe: filter + synthesize
- training-compute-optimal-large-language-models — Chinchilla; the framework being challenged
- scaling-laws-for-neural-language-models — Kaplan; original scaling laws
- microsoft-research — author lab
Citation
Abdin, M., Aneja, J., Awadalla, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint. https://arxiv.org/abs/2404.14219