Concepts: data-quality | scaling-laws | compute-optimal-training | pre-training Builds on: training-compute-optimal-large-language-models | scaling-laws-for-neural-language-models

Chinchilla said: train more tokens than you thought. Compute-optimal for a given budget. Phi-3 says: that scaling law was the floor, not the ceiling. With sufficiently curated data — heavy filtering of the web plus synthetic data generated by stronger models — a 3.8B parameter model can match the quality of Mixtral 8x7B (12.9B active) and reach within striking distance of GPT-3.5. The implication is that “data quality” is a separate axis from “data quantity,” and the field had been treating them as if they were the same.

The core idea

The analogy: Two students prepare for the same exam. Student A reads every textbook, every blog post, every Reddit thread on the topic — 10,000 pages of mixed-quality material. Student B has a tutor (a top student) who curates the 1,000 most pedagogically useful pages and writes 500 pages of fresh worked examples specifically calibrated to the exam. On test day, Student B wins, despite reading 10x less.

Phi-3-mini is Student B. It is trained on roughly 3.3T tokens — comparable to LLaMA-2 7B’s 2T — but the composition of those tokens is radically different:

  1. Heavily filtered web data. Where most prior models filtered at the document level (keep / drop), Phi-3 filters at finer granularity, using classifier-based scoring trained on human and model-generated quality labels.
  2. Synthetic data generated by GPT-4-class teachers. Worked examples, textbook-style explanations, problem-solution pairs. Generated to fill specific capability gaps the team identified.

“Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data.”

The architecture is unremarkable: standard decoder-only transformer, 32 layers, 32 attention heads, 3072 hidden dim, GQA. The recipe is the data.

What’s clever — find the instinct

The non-obvious move is rejecting the Chinchilla-optimal regime on purpose. Chinchilla says for a 3.8B model with optimal compute, train on ~80B tokens. Phi-3-mini trains on 3.3T — roughly 40x what compute-optimal would suggest. This is “over-training” by Chinchilla’s framework, and yet it produces a much stronger model than a compute-optimal 3.8B would.

The reason: Chinchilla optimizes for a specific objective (validation loss on a fixed text distribution) under a specific assumption (training data quality is held fixed). When you can change the data distribution itself, the optimum shifts. A small model trained on extremely high-quality data outperforms a larger model trained on noisy data, and the small model’s inference cost is permanently lower.

“Phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.”

The broader Phi philosophy (“textbooks are all you need,” from Phi-1/Phi-2): the bottleneck for small models is not parameter count but density of useful information per training token. If you can generate textbook-quality data at scale, parameter efficiency goes up.

The second clever move is the data-curation feedback loop. The team uses internal evals to identify capability gaps, then generates synthetic data targeted at those gaps, then re-trains. This is data-centric AI: the recipe is “look at where the model fails, generate examples that fix it.” This is essentially RLHF-as-data-generation, before the alignment stage.

“We… target the inference efficiency of the model… ensure that the model is robust against adversarial inputs.”

The third clever move is the family. Phi-3-mini (3.8B) ships with phi-3-small (7B), phi-3-medium (14B), then phi-3.5-mini, phi-3.5-MoE (16x3.8B = 6.6B active), phi-3.5-Vision. Each is the same data-quality recipe at different scales. The MoE version, in particular, achieves 6.6B active parameters with the breadth of a 60B-parameter model.

Walkthrough: phi-3-mini sits on a phone

Setup: phi-3-mini, 3.8B parameters, 4-bit quantized.
       Memory footprint: 3.8B * 0.5 bytes = 1.9 GB.
       iPhone 14: 6 GB RAM. Plenty of room.

  Run on-device:
    Prompt: "Translate this email to formal German: ..."
    Latency to first token: ~300 ms on iPhone Neural Engine.
    Throughput: ~12 tokens/sec sustained.
    Battery cost: ~5% per 1000 tokens (rough).

Compare to GPT-3.5 quality (cloud-only):
    On MMLU: GPT-3.5 ≈ 70%, phi-3-mini = 69%. Nearly tied.
    On MT-bench: GPT-3.5 ≈ 8.4, phi-3-mini = 8.38. Tied.
    Cloud round-trip: 200-1500 ms latency, $0.001-0.01 per call.
    On-device: zero latency, zero cost, full privacy.

Compare to LLaMA-2 7B (the size class people previously aimed for on edge):
    LLaMA-2 7B on MMLU: ~46%.
    phi-3-mini on MMLU: 69%.
    Phi is half the size and dramatically more capable.
    Difference: training data quality.

The deployment story is the punch line. A model good enough to draft emails, answer factual questions, and do basic reasoning, sitting on a phone with no network. This was implausible a year before this paper.

Does it work? What breaks?

ModelParamsMMLUMT-benchHumanEval
phi-3-mini3.8B69.08.3858.5
Mistral 7B7.0B60.37.628.0
LLaMA-2 7B7.0B46.06.9512.8
Mixtral 8x7B12.9B active70.08.3037.8
phi-3-small7B75.761.0
phi-3-medium14B78.062.2
GPT-3.5 (estimated)~175B~708.4~67

Phi-3-mini at 3.8B effectively matches Mixtral 8x7B (12.9B active) on MMLU and exceeds it on MT-bench. Phi-3-medium at 14B beats Mixtral on MMLU.

“Phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.”

What breaks:

  • Factual knowledge ceiling. Small models simply cannot store as many facts as large ones. Phi-3-mini fails on long-tail factual questions where Mixtral or GPT-4 succeed. The mitigation is RAG, not parameter scaling.
  • Multilingual is uneven. The data composition is English-heavy; non-English performance lags.
  • Synthetic-data risks. A model trained on GPT-4-generated text inherits GPT-4’s biases and failure modes; it also risks “model collapse” if the chain were repeated indefinitely. Phi-3 mixes synthetic with web-scraped to mitigate.
  • Benchmark contamination concerns. Critics note that phi-3 numbers on benchmarks like MMLU may be partly explained by training data containing similar question-answer formats. The team disputes this; the actual answer is somewhere in between.
  • Reasoning depth. On Big-Bench Hard or MATH, phi-3 does well but not at GPT-4 level. Compositional reasoning and multi-step proofs remain harder for small models.

So what?

For a practitioner deciding what model to deploy:

  1. Re-evaluate the small-model assumption. Three years ago, “small model” meant “useless for production.” Phi-3 changed that. A 3.8B-7B model is genuinely capable for many tasks (summarization, simple QA, code completion, classification).
  2. Edge / on-device inference is now viable. With 4-bit quantization, phi-3-mini fits in 1.9 GB. iPhone, mid-range Android, embedded SoCs can all run it. Privacy-sensitive applications (health, finance, on-device assistants) can stay local.
  3. Data quality > parameter count for many use cases. If you can curate or generate high-quality data for your task, a smaller model trained on that data may outperform a larger model trained on web crawl.
  4. Generation pipelines beat scrape pipelines. Phi-3’s textbook-style synthetic data is what closed the gap with larger models. A startup doing a vertical-specific LLM should consider generating more data with a stronger teacher rather than scaling parameters.
  5. For RAG systems, smaller models are increasingly fine. With retrieval handling factual recall, the LLM only needs to reason and synthesize. A 3.8B model can do that, and at 10-100x lower inference cost than a 70B+ model.

For Saikat’s L5 interview prep: Phi-3 is the modern counter-argument to “scaling laws say you must go big.” Knowing both sides — Chinchilla’s compute-optimal framework and Phi-3’s data-quality counter-argument — is what distinguishes a senior systems engineer from a model-trainer-by-numbers. The synthesis: scaling laws still hold given a fixed data distribution; the way to beat them is to change the distribution.

Connections

Citation

arXiv:2404.14219

Abdin, M., Aneja, J., Awadalla, H., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv preprint. https://arxiv.org/abs/2404.14219