Model Compression

What It Is

Model compression reduces a trained neural network’s size, speed, or memory footprint while preserving as much accuracy as possible. The main approaches are: knowledge distillation (train a smaller model on a larger model’s outputs), quantization (reduce numerical precision), pruning (remove low-importance weights), and low-rank factorization.

Why It Matters

Frontier models are too large to run on consumer hardware. Compression is the primary path from research-scale models to deployed products — the gap between GPT-4 and a 7B model running on a laptop.

How It Works

Knowledge distillation trains a smaller “student” on soft targets from a larger “teacher.” Quantization reduces float32 weights to int8 or int4 (4× memory reduction). Pruning zeros out weights below a magnitude threshold. These techniques are often combined (e.g., distil then quantize).

Key Sources

distillation — the dominant compression approach for neural networks
quantization — reducing numerical precision
inference-efficiency — the goal compression serves

ML Wiki

Explorer

Model Compression

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Model Compression

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks