You’ve got GPT-3. 175 billion parameters. It speaks 50 languages, writes code, explains quantum physics. Now your company needs a customer service bot for your product. The obvious move: fine-tune it on your data. One problem — fine-tuning 175 billion parameters requires roughly 1.2 terabytes of GPU memory, and you end up with a full 350GB copy of the model for every single use case you want. Five tasks, five 350GB checkpoints. That’s not a deployment strategy, that’s a storage disaster.
The core idea
The analogy: Imagine you’re a world-class pianist who’s spent 20 years mastering classical technique. Now you need to learn jazz. You don’t forget everything and start over — you make small adjustments. Your hand position changes slightly. You learn to tolerate some dissonance. The core skill stays; you layer a thin “jazz delta” on top. LoRA says: language model fine-tuning works the same way. The delta is thin. It lives in a low-dimensional space. So store the delta, not the whole instrument.
Here’s the key observation the paper builds on: when you fine-tune a large model, the weight updates (the “delta”) are not random. They’re structured. Specifically,
“We hypothesize that the change in weights during model adaptation also has a low ‘intrinsic rank’.”
What does “low rank” mean in practice? A weight matrix inside GPT-3 might be 12,288 × 12,288 — about 150 million numbers. But the change to that matrix during fine-tuning might be expressible as the outer product of two thin vectors, or two thin matrices. The paper’s bet: instead of storing the full delta (150M numbers), store two small matrices whose product approximates it.
The mechanism, step by step:
- Take any weight matrix W₀ in the model (say, the query projection in an attention layer).
- Freeze W₀ completely — no gradient flows through it, ever.
- Add a parallel “side branch”: two small matrices A (shape r × k) and B (shape d × r), where r is tiny (4, 8, or 16, vs d and k which are in the thousands).
- During the forward pass, the output is:
h = W₀x + BAx. The original signal, plus a small learned correction. - Train only A and B. The original model is untouched.
- At deployment: just compute
W_new = W₀ + BAand you have a single merged matrix — no extra computation at inference time.
BEFORE LoRA (full fine-tune):
Input x
|
v
[W₀ + ΔW] ← 12288×12288 = 150M params change
|
v
Output h
WITH LoRA:
Input x
|
+---> [W₀] ---------> (frozen, 150M params, 0 gradients)
| |
+---> [A: r×k] -> [B: d×r] -> (trainable, 2×12288×8 = 196k params)
|
+------------------------------+ (add together)
|
v
Output h
(same shape, learned correction injected for free)
AFTER TRAINING (merge and go):
W_new = W₀ + BA ← single matrix, zero inference overhead
The math, translated:
The forward pass is:
h = W₀x + BAx, scaled by α/r
W₀— the frozen pretrained weights. They know language. Don’t touch them.B— a d×r matrix (tall and thin). Trained from scratch, initialized to zero.A— an r×k matrix (short and wide). Trained from scratch, random Gaussian init.BA— their product. Starts at zero (since B=0 initially), so at step 0 the model is identical to the pretrained model. No shock at training start.α/r— a scaling factor. α is a constant (usually set equal to r and not tuned); dividing by r keeps the gradient scale stable as you change r.
Why initialize B to zero and A to random? Because BA = B × A starts at zero. The model begins as the original pretrained model, then gradually learns a correction. If you initialized both randomly, you’d inject noise from step 1.
Walkthrough with actual numbers:
Say we’re adapting a weight matrix W₀ of shape 4×4 (tiny example) with rank r=2.
W₀ (frozen, 4×4):
[0.8, 0.1, -0.3, 0.5]
[0.2, 0.9, 0.4, -0.1]
[-0.5, 0.3, 0.7, 0.2]
[0.1, -0.4, 0.2, 0.6]
A (2×4, random Gaussian init):
[0.3, -0.2, 0.5, 0.1]
[-0.1, 0.4, 0.2, -0.3]
B (4×2, zero init):
[0.0, 0.0]
[0.0, 0.0]
[0.0, 0.0]
[0.0, 0.0]
Step 0: BA = zero matrix. Output = W₀x ✓ (unchanged from pretrained)
After some training steps, B has learned:
B = [0.2, 0.1]
[-0.1, 0.3]
[0.4, -0.2]
[0.0, 0.5]
For input x = [1, 0, 0, 0]ᵀ:
W₀x = [0.8, 0.2, -0.5, 0.1]ᵀ (first column of W₀)
BAx = first column of BA = [0.05, 0.00, 0.14, 0.00]ᵀ (approx)
h = [0.85, 0.20, -0.36, 0.10]ᵀ ← small but real correction
What changed? The output shifted slightly — “pushed” toward the fine-tuning target. That shift was achieved with 2×4 + 4×2 = 16 parameters, instead of modifying the 16 parameters in W₀ itself (in this tiny example). In a real GPT-3 layer with d=12288 and r=8: 2×12288×8 = 196,608 parameters vs 12288² = 150,994,944. That’s a 768× compression per layer.
What’s clever here — find the instinct:
The obvious alternative to LoRA was adapter layers: insert small bottleneck modules between transformer layers. The community had been doing this. The problem? Adapters add depth, and deep models are parallelized across GPUs using careful pipeline scheduling. Extra depth means extra synchronization barriers. As the paper shows, on a single GPU with batch size 1, adapters add 20-30% latency.
The insight LoRA had: don’t add depth, add width — in parallel, not in series. A parallel low-rank branch can be merged into the existing weight matrix at inference time. The merged matrix is the same shape as the original. The extra computation literally disappears after training.
“When deployed in production, we can explicitly compute and store W = W₀ + BA and perform inference as usual. Note that both W₀ and BA are in ℝ^(d×k). When we need to switch to another downstream task, we can recover W₀ by subtracting BA and then adding a different B’A’.”
Translation: merge for one task, unmerge and swap for another. The base model is a shared hub; LoRA adapters are hot-swappable personality chips.
Another clever bit: the paper empirically checks whether the low-rank hypothesis is actually true:
“We find that ΔW has a stronger correlation with W compared to a random matrix, suggesting that ΔW amplifies some features that are already in W.”
The fine-tuning isn’t learning something the model never knew — it’s amplifying existing directions. The pretrained model already encoded most of what’s needed; fine-tuning just turns up the volume on specific features. That’s why rank 4 is often enough even when the matrix has rank 12,288.
“Using GPT-3 175B as an example, we show that a very low rank (i.e., r in Figure 1 can be one or two) suffices even when the full rank (i.e., d) is as high as 12,288.”
Translation: for the biggest model in existence at the time, you only need rank 1 or 2. The delta — the “jazz correction” — is almost one-dimensional.
“LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.”
Translation: from 175 billion trainable parameters to 17.5 million. From 1.2TB VRAM to ~350GB. Still needs a big machine to hold the base model, but the training overhead collapses.
Does it work? What breaks?
| Model | Method | Trainable Params | GLUE Avg / Task Score |
|---|---|---|---|
| RoBERTa-large (355M) | Full fine-tune | 355M | 88.9 |
| RoBERTa-large (355M) | Adapters (AdaptH) | 0.8M | 86.4 |
| RoBERTa-large (355M) | LoRA | 0.8M | 89.0 |
| GPT-3 175B | Full fine-tune | 175B | WikiSQL 73.8% |
| GPT-3 175B | Adapters (40M params) | 40M | WikiSQL 73.2% |
| GPT-3 175B | LoRA | 4.7M | WikiSQL 73.4% |
LoRA with 4.7M parameters matches GPT-3 full fine-tuning (175B parameters) on WikiSQL accuracy (73.4% vs 73.8%). On SAMSum summarization, LoRA actually beats full fine-tuning: ROUGE-1 of 53.8 vs 52.0.
What doesn’t work:
Batching is awkward. If you merge A and B into W for zero latency, you’re locked to one adapter per forward pass. Serving multiple fine-tunes simultaneously on the same hardware requires either keeping A and B separate (adds inference cost) or routing requests to separate model instances.
Rank selection is not automatic. Rank 4 works well on average, but no one knows the right rank for a new task without experimenting. Results don’t improve monotonically with rank — sometimes higher rank hurts.
The method was only validated on attention weight matrices (Wq, Wv). The paper explicitly says:
“We leave the empirical investigation of adapting the MLP layers, LayerNorm layers, and biases to a future work.”
So what?
If you’re building ML systems today, LoRA is your default fine-tuning method for any model over ~7B parameters. The workflow is: grab the base model, pick r=8 or r=16, apply LoRA to Wq and Wv (or all four attention matrices), train, merge. You get full fine-tune quality at 1/100th the checkpoint size. The Hugging Face peft library makes this three lines of Python.
When not to use it: if you need the fine-tuned model to learn something radically different from what the base model knows — a new language from scratch, a totally alien task structure — the low-rank assumption may fail. Also, LoRA on a bad base model doesn’t fix the base model.
LoRA didn’t just save GPU memory — it democratized fine-tuning, shifting it from a lab-only operation to something a developer can run on a gaming PC.
Connections
- lora — the PEFT technique introduced here
- transformer — architecture being adapted
- sft — LoRA is often used as an efficient SFT method
- attention — LoRA is most commonly applied to Q and V projections
Citation
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. https://arxiv.org/abs/2106.09685