Stub — full concept page pending.

Quantization reduces the numerical precision of model weights (and optionally activations and KV cache entries) from FP32 or FP16 to INT8 or INT4. This reduces memory footprint 2–4x, enabling larger batch sizes, lower latency, and deployment on hardware with limited memory. The core challenge is minimizing accuracy degradation: weights are not uniformly distributed, so naive rounding loses information at the tails. Post-training quantization methods like GPTQ and AWQ use calibration data to minimize reconstruction error. QLoRA demonstrates that 4-bit quantization is compatible with fine-tuning via LoRA adapters.

Related sources: qlora-efficient-finetuning-quantized-llms