The Problem
Fine-tuning a 7B parameter model on a new task means training 7 billion numbers. Each training step requires storing gradients and optimizer state for all of them — typically 3-4x the model size in memory. Fine-tuning LLaMA-3 70B full-precision requires ~500GB of GPU memory. That’s 6 A100s just to start.
The practical question: do you actually need to change all 7 billion parameters to teach the model a new task?
Empirically, no. The weight updates during fine-tuning are low-rank: they live in a much smaller subspace than the full weight matrix would suggest.
Analogy
Imagine tuning a grand piano. You don’t replace every string and key. You make small targeted adjustments — maybe just the sustain pedal and a few keys in one octave. The underlying instrument stays the same; your adjustments are sparse and targeted.
LoRA does something mathematically analogous. The pretrained model is the piano. Fine-tuning adjusts only a thin “correction” that lives in a low-dimensional subspace. At deployment, you just add the correction to the original — same sound, lighter to store and swap.
Mechanism in Plain English
A weight matrix in a Transformer layer has shape — say 4096 × 4096 for GPT-3’s Q projection. That’s 16.7M parameters in one matrix.
The full fine-tuning update would also be 4096 × 4096. LoRA instead says: constrain to be the product of two small matrices.
- Freeze completely. No gradients flow through it.
- Add a bypass: the input also flows through , where:
- has shape (e.g., 8 × 4096 = 32K parameters)
- has shape (e.g., 4096 × 8 = 32K parameters)
- is the rank — typically 4, 8, or 16
- The adapted output is:
- Initialize with random Gaussian, with zeros → at training start
- Only and are trained
- At inference: merge , zero extra latency
ASCII Diagram
Standard fine-tuning:
x ──→ [W₀ + ΔW] ──→ h
↑ full 4096×4096 trainable
─────────────────────────────────────────────
LoRA:
┌──→ [W₀ (frozen)] ──→─────────────┐
x ──→ │ + ──→ h
└──→ [A (r×k)] ──→ [B (d×r)] ──→──┘
↑ trainable ↑ ↑ trainable ↑
A: 8×4096 = 32,768 parameters
B: 4096×8 = 32,768 parameters
vs. full ΔW: 4096×4096 = 16,777,216 parameters
512x fewer parameters to train in this layer
Math with Translation
The update is where:
- — maps from low-rank space back to model dimension
- — maps from input to low-rank space
- — the rank constraint
The forward pass adds a scaling term:
- is a scalar hyperparameter (often set to the same value as , making )
- The scaling makes the effective learning rate of LoRA weights independent of
If and :
- Full : 16,777,216 parameters
- LoRA (): 65,536 parameters
- Reduction: 256x
Concrete Walkthrough
Take a 4D toy model: d=4, k=4, rank r=2.
W₀ (frozen) = [[1,0,0,1], [0,1,0,0], [0,0,1,0], [1,0,0,1]]
After 100 training steps, LoRA learns:
A (2×4) = [[0.1, -0.2, 0.0, 0.3], [-0.1, 0.0, 0.2, 0.1]]
B (4×2) = [[0.5, 0.0], [0.0, 0.3], [0.2, 0.1], [0.0, 0.4]]
Then ΔW = B·A (4×4), e.g. entry [0,0] = 0.5·0.1 + 0·(-0.1) = 0.05
The merged weight W’ = W₀ + ΔW. At inference, this is just a 4×4 matmul — identical cost to the original.
For a 7B model with r=8 applied to all Q, K, V, and output projections in every layer: roughly 20M trainable parameters vs 7B — 0.3% of the original.
What’s Clever
The key insight is empirical before it’s theoretical: the intrinsic dimensionality of task-specific fine-tuning is small.
Li et al. (2018) showed that LLMs can be fine-tuned in surprisingly low-dimensional spaces — most tasks don’t need the full expressive power of the weight space. LoRA makes this concrete and practical.
The second clever thing: no inference cost. Adapter-based approaches (Houlsby et al. 2019) also freeze the base model but insert new layers — that changes the architecture and adds latency. LoRA’s BA is merged at deployment, vanishing completely. You get the savings during training, pay nothing at inference.
The third clever thing: swappable specializations from one backbone. Because LoRA adapters are tiny (64MB vs 14GB for the base model), you can serve one base model and hot-swap adapters per request. This is how many production serving systems run tens of fine-tuned variants on a single GPU cluster.
Code
LoRA adapter as a drop-in replacement for nn.Linear:
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank=4, alpha=1.0):
super().__init__()
self.W0 = nn.Linear(in_features, out_features, bias=False) # pretrained weights
self.W0.requires_grad_(False) # freeze — no gradients
self.A = nn.Linear(in_features, rank, bias=False) # down-project to rank
self.B = nn.Linear(rank, out_features, bias=False) # up-project back
nn.init.kaiming_uniform_(self.A.weight) # A: random init (standard)
nn.init.zeros_(self.B.weight) # B: zero init → ΔW=BA=0 at start
self.scale = alpha / rank # scaling: makes LR independent of rank
def forward(self, x):
return self.W0(x) + self.scale * self.B(self.A(x)) # frozen + low-rank update
# Parameter count comparison (in_features=4096, out_features=4096, rank=8):
# Full ΔW: 4096×4096 = 16,777,216 params
# LoRA A+B: 4096×8 + 8×4096 = 65,536 params (256x reduction)Key Sources
Related Concepts
Open Questions
- Optimal rank and which layers to apply LoRA to for different tasks
- Whether merged LoRA weights can be composed (LoRA merging/arithmetic)