Quantization

The Problem

A 70B parameter model in FP16 needs 140 GB of GPU memory — just to load the weights. That’s four A100 40GB GPUs before you add KV cache, activations, or the overhead of actually running inference. Quantization compresses model weights from 16-bit floats to 8-bit or 4-bit integers, cutting memory 2–4× and, for weight-only formats, directly improving inference throughput: fewer bytes moved from HBM to compute cores per forward pass.

The challenge is that quantization is lossy. When you round a floating-point weight to the nearest integer on a coarse grid, you introduce error. Across 70 billion weights, those errors accumulate. The problem is figuring out which errors to let through.

The Key Insight

Not all weights are equally important. A weight channel that processes a feature the model uses heavily — one with large input activation magnitudes — matters far more than a channel whose activations are near zero. Naive quantization (round-to-nearest) treats both identically.

The key observation from AWQ: importance = weight magnitude × activation magnitude. Prior work identified salient weights by their own magnitudes (L2 norm). This is wrong. A small weight processing a large feature matters more than a large weight processing a tiny feature — and the activation distribution tells you which is which.

NAIVE ROUND-TO-NEAREST INT3:

  Group: [w1=0.01, w2=0.01, w3=6.0, w4=0.01]   activations: [x1=50, x2=1, x3=0.1, x4=0.5]
  
  Δ = max(|w|) / 2^2 = 6.0 / 4 = 1.5   (quantization step)
  w1 = 0.01 → rounds to 0            (zeroed)
  Output error: Δ × 0.25 × x1 = 1.5 × 0.25 × 50 = 18.75   ← LARGE
  
  w1 was the most important channel (highest w·x product) — but got worst precision

AWQ FIX:  scale w1 by s=2 (based on activation magnitude)
  w1 → 0.02  (Δ unchanged, rounding is more precise relative to value)
  x1 → 25.0  (activation scaled down by s, absorbed into previous layer)
  Output error: 1.5 × 0.25 × 25 = 9.4   (2× smaller)

Two major families of PTQ (post-training quantization) for LLMs:

AWQ: scale salient weight channels up proportional to their activation statistics; no backpropagation needed; generalizes across domains
GPTQ: use second-order information to compensate for rounding errors column by column; more complex, can overfit to calibration domain

For LLMs, PTQ dominates over QAT (quantization-aware training) — you cannot afford to retrain a 70B model every time you want a 4-bit deployment.

What’s Clever

The W4A16 format (4-bit weights, 16-bit activations) is the sweet spot for inference. Weight-only quantization targets the actual bottleneck: during token generation, the GPU spends most of its time moving weight matrices from HBM to SRAM — not computing. Halving weight precision cuts this memory traffic directly, improving generation speed. Activations stay in FP16 for accuracy.

The 1% observation is surprising: keeping only 0.1–1% of weight channels in FP16 (selected by activation magnitude) recovers most of the quantization loss from INT3. The other 99% can be aggressively rounded.

Memory for a 70B model:
  FP16:   70B × 2 bytes  = 140 GB   (4 A100-40GB GPUs)
  INT8:   70B × 1 byte   =  70 GB   (2 GPUs)
  INT4:   70B × 0.5 byte =  35 GB   (1 GPU — fits on a single A100)
  INT3:   70B × 0.375    ≈  26 GB   (consumer GPU territory)

Quality loss (LLaMA-2-7B, WikiText-2 PPL):
  FP16:          5.47
  INT4 AWQ:      5.78  (+0.31, <6% gap)
  INT4 GPTQ:     5.83
  INT4 RTN:      5.96

Key Sources

qlora-efficient-finetuning-quantized-llms — NF4 quantization for fine-tuning; shows quantized backbones can train LoRA adapters without accuracy loss
awq-activation-aware-weight-quantization — AWQ; activation-guided per-channel scaling for PTQ; standard method in vLLM, TensorRT-LLM, llama.cpp
gptq-accurate-post-training-quantization

inference-efficiency — quantization is one of several inference optimization strategies
memory-efficiency — quantization directly reduces weight memory footprint 2–4×
lora — LoRA adapters can be trained on top of quantized frozen backbones (QLoRA)
compression — quantization alongside pruning and distillation

ML Wiki

Explorer

Quantization

The Problem

The Key Insight

What’s Clever

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Quantization

The Problem

The Key Insight

What’s Clever

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks