Memory Efficiency

What It Is

Techniques that reduce the GPU memory required to train or run a neural network, enabling larger models or larger batches on fixed hardware.

Why It Matters

GPU memory is the primary bottleneck for large model training and inference. A 65B parameter model in BF16 requires 130GB of memory for weights alone — before gradients, optimizer states, or activations. Memory-efficient techniques make these models accessible on fewer, cheaper GPUs.

How It Works

Major strategies:

Quantization: store weights in lower precision (4-bit, 8-bit) instead of BF16/FP32
Gradient checkpointing: recompute activations during backward pass instead of storing them (trades compute for memory)
Low-rank adapters (LoRA): keep the backbone frozen; only maintain small adapter matrices in memory
Mixed precision: compute in BF16 but store master weights in FP32
Paged optimizers: overflow optimizer states to CPU RAM during memory spikes

QLoRA combines quantization and LoRA: the frozen backbone is compressed to 4-bit NF4 (~130GB → 34GB for 65B), while LoRA adapters stay in BF16. Together, a 65B fine-tune fits on a single 48GB GPU.

Key Sources

qlora-efficient-finetuning-quantized-llms
awq-activation-aware-weight-quantization — AWQ; 4× memory reduction for 70B models via INT4 without quality loss
llama-open-efficient-foundation-language-models
switch-transformer-sparse-mixture-of-experts
gptq-accurate-post-training-quantization
pytorch-fsdp-fully-sharded-data-parallel
zero-memory-optimizations-trillion-parameter-models

ML Wiki

Explorer

Memory Efficiency

What It Is

Why It Matters

How It Works

Key Sources

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Memory Efficiency

What It Is

Why It Matters

How It Works

Key Sources

Related Concepts

Graph View

Table of Contents

Backlinks