What It Is
Model compression reduces a trained neural network’s size, speed, or memory footprint while preserving as much accuracy as possible. The main approaches are: knowledge distillation (train a smaller model on a larger model’s outputs), quantization (reduce numerical precision), pruning (remove low-importance weights), and low-rank factorization.
Why It Matters
Frontier models are too large to run on consumer hardware. Compression is the primary path from research-scale models to deployed products — the gap between GPT-4 and a 7B model running on a laptop.
How It Works
Knowledge distillation trains a smaller “student” on soft targets from a larger “teacher.” Quantization reduces float32 weights to int8 or int4 (4× memory reduction). Pruning zeros out weights below a magnitude threshold. These techniques are often combined (e.g., distil then quantize).
Key Sources
Related Concepts
- distillation — the dominant compression approach for neural networks
- quantization — reducing numerical precision
- inference-efficiency — the goal compression serves