The Problem

Fine-tuning a 7B parameter model on a new task means training 7 billion numbers. Each training step requires storing gradients and optimizer state for all of them — typically 3-4x the model size in memory. Fine-tuning LLaMA-3 70B full-precision requires ~500GB of GPU memory. That’s 6 A100s just to start.

The practical question: do you actually need to change all 7 billion parameters to teach the model a new task?

Empirically, no. The weight updates during fine-tuning are low-rank: they live in a much smaller subspace than the full weight matrix would suggest.

Analogy

Imagine tuning a grand piano. You don’t replace every string and key. You make small targeted adjustments — maybe just the sustain pedal and a few keys in one octave. The underlying instrument stays the same; your adjustments are sparse and targeted.

LoRA does something mathematically analogous. The pretrained model is the piano. Fine-tuning adjusts only a thin “correction” that lives in a low-dimensional subspace. At deployment, you just add the correction to the original — same sound, lighter to store and swap.

Mechanism in Plain English

A weight matrix in a Transformer layer has shape — say 4096 × 4096 for GPT-3’s Q projection. That’s 16.7M parameters in one matrix.

The full fine-tuning update would also be 4096 × 4096. LoRA instead says: constrain to be the product of two small matrices.

  1. Freeze completely. No gradients flow through it.
  2. Add a bypass: the input also flows through , where:
    • has shape (e.g., 8 × 4096 = 32K parameters)
    • has shape (e.g., 4096 × 8 = 32K parameters)
    • is the rank — typically 4, 8, or 16
  3. The adapted output is:
  4. Initialize with random Gaussian, with zeros → at training start
  5. Only and are trained
  6. At inference: merge , zero extra latency

ASCII Diagram

  Standard fine-tuning:
  
  x ──→ [W₀ + ΔW] ──→ h
         ↑ full 4096×4096 trainable
  
  ─────────────────────────────────────────────

  LoRA:
  
         ┌──→ [W₀ (frozen)] ──→─────────────┐
  x ──→  │                                   + ──→ h
         └──→ [A (r×k)] ──→ [B (d×r)] ──→──┘
              ↑ trainable ↑   ↑ trainable ↑
              
  A: 8×4096 = 32,768 parameters
  B: 4096×8 = 32,768 parameters
  
  vs. full ΔW: 4096×4096 = 16,777,216 parameters
  
  512x fewer parameters to train in this layer

Math with Translation

The update is where:

  • — maps from low-rank space back to model dimension
  • — maps from input to low-rank space
  • — the rank constraint

The forward pass adds a scaling term:

  • is a scalar hyperparameter (often set to the same value as , making )
  • The scaling makes the effective learning rate of LoRA weights independent of

If and :

  • Full : 16,777,216 parameters
  • LoRA (): 65,536 parameters
  • Reduction: 256x

Concrete Walkthrough

Take a 4D toy model: d=4, k=4, rank r=2.

W₀ (frozen) = [[1,0,0,1], [0,1,0,0], [0,0,1,0], [1,0,0,1]]

After 100 training steps, LoRA learns:

A (2×4) = [[0.1, -0.2, 0.0, 0.3], [-0.1, 0.0, 0.2, 0.1]]

B (4×2) = [[0.5, 0.0], [0.0, 0.3], [0.2, 0.1], [0.0, 0.4]]

Then ΔW = B·A (4×4), e.g. entry [0,0] = 0.5·0.1 + 0·(-0.1) = 0.05

The merged weight W’ = W₀ + ΔW. At inference, this is just a 4×4 matmul — identical cost to the original.

For a 7B model with r=8 applied to all Q, K, V, and output projections in every layer: roughly 20M trainable parameters vs 7B — 0.3% of the original.

What’s Clever

The key insight is empirical before it’s theoretical: the intrinsic dimensionality of task-specific fine-tuning is small.

Li et al. (2018) showed that LLMs can be fine-tuned in surprisingly low-dimensional spaces — most tasks don’t need the full expressive power of the weight space. LoRA makes this concrete and practical.

The second clever thing: no inference cost. Adapter-based approaches (Houlsby et al. 2019) also freeze the base model but insert new layers — that changes the architecture and adds latency. LoRA’s BA is merged at deployment, vanishing completely. You get the savings during training, pay nothing at inference.

The third clever thing: swappable specializations from one backbone. Because LoRA adapters are tiny (64MB vs 14GB for the base model), you can serve one base model and hot-swap adapters per request. This is how many production serving systems run tens of fine-tuned variants on a single GPU cluster.

Code

LoRA adapter as a drop-in replacement for nn.Linear:

import torch
import torch.nn as nn
 
class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=4, alpha=1.0):
        super().__init__()
        self.W0 = nn.Linear(in_features, out_features, bias=False)  # pretrained weights
        self.W0.requires_grad_(False)                                # freeze — no gradients
        self.A = nn.Linear(in_features, rank, bias=False)           # down-project to rank
        self.B = nn.Linear(rank, out_features, bias=False)          # up-project back
        nn.init.kaiming_uniform_(self.A.weight)   # A: random init (standard)
        nn.init.zeros_(self.B.weight)             # B: zero init → ΔW=BA=0 at start
        self.scale = alpha / rank                 # scaling: makes LR independent of rank
 
    def forward(self, x):
        return self.W0(x) + self.scale * self.B(self.A(x))  # frozen + low-rank update
 
# Parameter count comparison (in_features=4096, out_features=4096, rank=8):
# Full ΔW: 4096×4096 = 16,777,216 params
# LoRA A+B: 4096×8 + 8×4096 = 65,536 params  (256x reduction)

Key Sources

Open Questions

  • Optimal rank and which layers to apply LoRA to for different tasks
  • Whether merged LoRA weights can be composed (LoRA merging/arithmetic)