What It Is

Model compression reduces a trained neural network’s size, speed, or memory footprint while preserving as much accuracy as possible. The main approaches are: knowledge distillation (train a smaller model on a larger model’s outputs), quantization (reduce numerical precision), pruning (remove low-importance weights), and low-rank factorization.

Why It Matters

Frontier models are too large to run on consumer hardware. Compression is the primary path from research-scale models to deployed products — the gap between GPT-4 and a 7B model running on a laptop.

How It Works

Knowledge distillation trains a smaller “student” on soft targets from a larger “teacher.” Quantization reduces float32 weights to int8 or int4 (4× memory reduction). Pruning zeros out weights below a magnitude threshold. These techniques are often combined (e.g., distil then quantize).

Key Sources