AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration

Concepts: quantization | inference-efficiency | memory-efficiency Builds on: qlora-efficient-finetuning-quantized-llms — QLoRA quantizes LLMs for fine-tuning; AWQ solves the same quantization problem for inference deployment Leads to: enables INT4 deployment of LLaMA, Mistral, and Mixtral models in vLLM, TensorRT-LLM, and llama.cpp

Shrinking a 7B model from 14 GB to 3.5 GB sounds like a blunt trade-off: lose precision, lose accuracy. But here is the part that breaks that intuition — which bits you discard matters enormously. The difference between a 4-bit LLM that works and one that hallucinates more often is whether you knew, before quantizing, which weight channels actually needed their precision preserved.

AWQ (Activation-aware Weight Quantization, Lin et al. 2023) figures that out in about 10 minutes, using the model’s own activations on a small calibration set — and then protects those critical channels without the mixed-precision hardware complexity that makes other approaches painful to deploy.

The core idea

The analogy: Think of a cassette tape recording a 128-piece orchestra. The tape has limited dynamic range — only so many gradations between silence and maximum volume. Recording everything at the same level, you will get clean brass and muffled piccolo: the piccolo is quiet relative to the group, so it gets rounded aggressively into the tape’s coarse grid.

The audio engineer’s solution: pre-emphasis. Boost the piccolo’s microphone gain before recording — now it sits in the middle of the dynamic range where the tape is most precise. When playing back, apply the inverse attenuation. The listener hears the exact same mix. But the piccolo survived quantization cleanly because it spent its time in the quantization sweet spot.

AWQ does exactly this for LLM weights. The “piccolo” is the weight channel whose corresponding activations are large — the feature the model uses heavily. Scale that weight channel up before quantizing, divide the corresponding activations down to compensate. The mathematical output is unchanged. The quantization error for the channels that matter most is reduced by the scale factor.

The mechanism, step by step:

Run a forward pass on 16–128 calibration sequences. Record the average magnitude of each input feature channel. This takes minutes.
Identify salient channels: those with the highest average activation magnitudes. These are the features the model leans on most.
For each salient weight channel: multiply the weight by a scaling factor $s > 1$ . The corresponding activation (output of the previous layer) gets divided by $s$ — fused into the previous layer’s computation with no overhead.
Quantize everything uniformly to INT4 or INT3. The salient channels, boosted by $s$ , now have smaller relative quantization error. Everything else is quantized as before.
Done. No backpropagation, no gradient through quantization, no retraining.

STANDARD INT3 QUANTIZATION:

Weight group (one of ~160M groups in a 7B model):
  Channels:     [w1=0.01, w2=0.01, w3=6.0, w4=0.01, w5=0.02]
  Activations:  [ x1=50,   x2=1,   x3=0.1,  x4=0.5,  x5=2  ]

  Δ = max(|w|) / 2^(N-1) = 6.0 / 3 = 2.0   (N=3 bits → 8 levels)
  w1 rounds to: Round(0.01/2.0) × 2.0 = 0    ← completely zeroed
  Error on w1: Δ × 0.25 × x1 = 2.0 × 0.25 × 50 = 25.0   ← large!

  w1 has the SMALLEST weight but the LARGEST activation → most important channel
  Standard quantization treats it as noise

AWQ: scale salient channel (w1, x1) by s=2:

  w1 → 0.01 × 2 = 0.02   (Δ unchanged: still 6.0/3 = 2.0)
  x1 → 50 / 2 = 25       (absorbed into previous layer's output scaling)

  Error on w1: 2.0 × 0.25 × 25 = 12.5   (2× smaller, same math, better precision)

  With s=4:  Error = 2.0 × 0.25 × 12.5 = 6.25   (4× smaller)
  
  But with s=100: w1×s = 1.0 becomes the new group max → Δ increases → hurts w3
  Sweet spot found by grid search over α ∈ [0,1] where s = mean_activation^α

The math (only what matters):

Quantizing weight $w$ to $N$ -bit integers:

$Q (w) = Δ \cdot Round (\frac{w}{Δ}), Δ = \frac{m a x ( ∣ w ∣ )}{2 ^{N - 1}}$

The error on the output $Q (w) \cdot x$ is:

$Err (Q (w) \cdot x) = Δ \cdot \approx 0.25 RoundErr (\frac{w}{Δ}) \cdot x$

Now apply AWQ’s scaling — weight $\times s$ , activation $/ s$ :

$Err (Q (w \cdot s) \cdot \frac{x}{s}) = Δ^{'} \cdot RoundErr (\frac{w s}{Δ ^{'}}) \cdot \frac{x}{s}$

As long as scaling one channel doesn’t shift the group maximum ( $Δ^{'} \approx Δ$ ), the error shrinks by $s$ :

$\approx Δ \cdot 0.25 \cdot \frac{x}{s}$

For the salient channel where $x$ is large, this is a real reduction. The paper finds $s$ via a fast grid search:

$s^{*} = s_{X}^{α^{*}}, α^{*} = ar g min_{α \in [0, 1]} Q (W \cdot diag (s)) \cdot diag (s)^{- 1} X - WX$

where $s_{X}$ is the per-channel mean activation magnitude. A grid of 20 values for $α$ suffices — the search completes in minutes, uses only forward passes, and needs only 16 calibration sequences.

Numeric walkthrough from the paper:

OPT-6.7B, INT3 with group size 128, WikiText-2 perplexity (lower is better):

FP16 (no quantization):              PPL = 10.86
Round-to-nearest INT3 (RTN):         PPL = 23.54  — 2.2× worse than FP16
Keep 1% of channels in FP16          PPL = 11.39  — nearly recovered
  (selected by activation magnitude)
Keep 1% of channels in FP16          PPL = 22.37  — barely helps
  (selected by weight magnitude)                     ← this is what prior work did
AWQ (activation-aware scaling):      PPL = 11.39  — matches mixed-precision, hardware-friendly

s = 1.25:                            PPL = 12.87  — too mild
s = 2.0:                             PPL = 11.92  — empirical sweet spot
s = 4.0:                             PPL = 12.36  — too aggressive (Δ rises, hurts others)

The key number: selecting salient channels by weight magnitude is no better than random. Selection by activation magnitude recovers nearly all the quantization loss. The mechanism is the model architecture itself — transformer weights multiply activations, so it is the product that matters, not either factor alone.

What’s clever:

“the insight is that we should refer to the activation distribution instead of the weight distribution, despite we are doing weight-only quantization: weight channels corresponding to larger activation magnitudes are more salient since they process more important features.”

This seems obvious in retrospect, but GPTQ — the prior SOTA — uses second-order Hessian information derived from the weights themselves to compensate for quantization error column by column. It works, but it requires backpropagation-like reconstruction that can overfit to the calibration domain.

AWQ’s insight: you don’t need to adjust any weights. You just need to scale channels before the quantization grid is applied, and the activations take care of the inverse. No gradient. No weight update. No domain overfitting.

“AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set.”

This makes AWQ the first method to quantize visual language models cleanly. GPTQ calibrated on text will drift when you deploy on image captions. AWQ does not — it’s measuring a structural property of how the model uses its features, not fitting error on a specific domain.

Does it actually work? What breaks?

Model	Format	Method	WikiText-2 PPL	Gap from FP16
LLaMA-2-7B	FP16	—	5.47	—
LLaMA-2-7B	INT4-g128	RTN	5.96	+0.49
LLaMA-2-7B	INT4-g128	GPTQ	5.83	+0.36
LLaMA-2-7B	INT4-g128	AWQ	5.78	+0.31
LLaMA-2-70B	INT4-g128	FP16	3.32	—
LLaMA-2-70B	INT4-g128	AWQ	3.41	+0.09

On GSM8K math reasoning with Llama-2-7B INT4-g128: AWQ scores 13.57% vs. FP16’s 13.87% — essentially lossless on reasoning tasks. On MBPP code generation with CodeLlama-7B: AWQ 40.64% pass@1 vs. FP16’s 38.53%. (AWQ’s scaling actually helps slightly.)

TinyChat, the inference engine the authors ship with AWQ, achieves 3.2–3.9× speedup over HuggingFace FP16 on RTX 4090 by fusing dequantization into the GEMM kernel and packing 4-bit weights to align with SIMD register widths. A 70B model fits and runs on a single Jetson Orin (64 GB). A 13B model runs at 33 tokens/second on a laptop RTX 4070 with 8 GB — the FP16 version doesn’t even load.

What doesn’t work:

INT2 quantization is brutal. At 2 bits, AWQ+GPTQ combined reaches PPL = 15.71 on OPT-6.7B vs. FP16’s 10.86. The scaling trick buys roughly one bit of effective precision; it cannot conjure bits from nothing.

AWQ is also purely a weight quantization technique (W4A16). If you need to quantize activations too — for integer-arithmetic-only hardware or W8A8 — you need SmoothQuant or similar. The activation-scaling trick does not extend cleanly to activation quantization because activations are input-dependent and change with each forward pass.

Finally, AWQ assumes activation statistics are stable across deployment. In practice the paper shows this is robust — using PubMed calibration data on Enron email inference only degrades AWQ by 0.5–0.6 PPL, vs. 2.3–4.9 PPL degradation for GPTQ. But in truly out-of-distribution scenarios, the importance estimates could be off.

So what?

If you are deploying any open-weight LLM today and you care about hardware cost, AWQ INT4-g128 is your default starting point. It is natively supported in llama.cpp, HuggingFace Transformers, NVIDIA TensorRT-LLM, vLLM, LMDeploy, and AMD’s stack. The practical decision rule: use AWQ INT4 when you need 4× memory reduction with less than 1% quality loss on most tasks. Avoid round-to-nearest (RTN) quantization in production — AWQ is almost always strictly better at negligible additional computation (minutes of calibration vs. no calibration).

The broader principle AWQ demonstrates: a model’s activations are a window into what it actually uses. Weight magnitudes tell you what the model stores. Activation magnitudes tell you what it reads. When you need to compress a model, you should ask the model’s data what it cares about — not just inspect the weights in isolation. The same principle shows up in activation-based pruning, layer-skipping heuristics, and mixture-of-depths routing. AWQ is a clean, early example of building compression decisions on this observation.

“AWQ has been widely adopted by industry and open-source community: HuggingFace Transformers, NVIDIA TensorRT-LLM, Microsoft DirectML, Google Vertex AI, Intel Neural Compressor, Amazon SageMaker, AMD, FastChat, vLLM, LMDeploy.”

AWQ’s answer to “which 1% of your model actually matters”: look at what the activations have been telling you all along.

Paper: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al. — 2023

Connections

quantization — AWQ is the standard PTQ method for W4A16; introduces activation-guided per-channel scaling
inference-efficiency — AWQ + TinyChat achieves 3–4× inference speedup; enables edge deployment of 70B models
memory-efficiency — INT4 quantization reduces memory 4× (140 GB → 35 GB for a 70B model)
qlora-efficient-finetuning-quantized-llms — QLoRA uses NF4 quantization for fine-tuning on the same class of models
mistral-7b — Mistral and successor models widely deployed via AWQ

Citation

arXiv:2306.00978

Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., & Han, S. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. MLSys 2024. https://arxiv.org/abs/2306.00978

ML Wiki

Explorer

AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration

The core idea

Does it actually work? What breaks?

So what?

Connections

Citation

Graph View

Table of Contents

Backlinks