What It Is

Techniques that reduce the GPU memory required to train or run a neural network, enabling larger models or larger batches on fixed hardware.

Why It Matters

GPU memory is the primary bottleneck for large model training and inference. A 65B parameter model in BF16 requires 130GB of memory for weights alone — before gradients, optimizer states, or activations. Memory-efficient techniques make these models accessible on fewer, cheaper GPUs.

How It Works

Major strategies:

  • Quantization: store weights in lower precision (4-bit, 8-bit) instead of BF16/FP32
  • Gradient checkpointing: recompute activations during backward pass instead of storing them (trades compute for memory)
  • Low-rank adapters (LoRA): keep the backbone frozen; only maintain small adapter matrices in memory
  • Mixed precision: compute in BF16 but store master weights in FP32
  • Paged optimizers: overflow optimizer states to CPU RAM during memory spikes

QLoRA combines quantization and LoRA: the frozen backbone is compressed to 4-bit NF4 (~130GB → 34GB for 65B), while LoRA adapters stay in BF16. Together, a 65B fine-tune fits on a single 48GB GPU.

Key Sources