What It Is
Techniques that reduce the GPU memory required to train or run a neural network, enabling larger models or larger batches on fixed hardware.
Why It Matters
GPU memory is the primary bottleneck for large model training and inference. A 65B parameter model in BF16 requires 130GB of memory for weights alone — before gradients, optimizer states, or activations. Memory-efficient techniques make these models accessible on fewer, cheaper GPUs.
How It Works
Major strategies:
- Quantization: store weights in lower precision (4-bit, 8-bit) instead of BF16/FP32
- Gradient checkpointing: recompute activations during backward pass instead of storing them (trades compute for memory)
- Low-rank adapters (LoRA): keep the backbone frozen; only maintain small adapter matrices in memory
- Mixed precision: compute in BF16 but store master weights in FP32
- Paged optimizers: overflow optimizer states to CPU RAM during memory spikes
QLoRA combines quantization and LoRA: the frozen backbone is compressed to 4-bit NF4 (~130GB → 34GB for 65B), while LoRA adapters stay in BF16. Together, a 65B fine-tune fits on a single 48GB GPU.