LoRA (Low-Rank Adaptation)

The Problem

Fine-tuning a 7B parameter model on a new task means training 7 billion numbers. Each training step requires storing gradients and optimizer state for all of them — typically 3-4x the model size in memory. Fine-tuning LLaMA-3 70B full-precision requires ~500GB of GPU memory. That’s 6 A100s just to start.

The practical question: do you actually need to change all 7 billion parameters to teach the model a new task?

Empirically, no. The weight updates during fine-tuning are low-rank: they live in a much smaller subspace than the full weight matrix would suggest.

Analogy

Imagine tuning a grand piano. You don’t replace every string and key. You make small targeted adjustments — maybe just the sustain pedal and a few keys in one octave. The underlying instrument stays the same; your adjustments are sparse and targeted.

LoRA does something mathematically analogous. The pretrained model is the piano. Fine-tuning adjusts only a thin “correction” that lives in a low-dimensional subspace. At deployment, you just add the correction to the original — same sound, lighter to store and swap.

Mechanism in Plain English

A weight matrix $W_{0}$ in a Transformer layer has shape $d \times k$ — say 4096 × 4096 for GPT-3’s Q projection. That’s 16.7M parameters in one matrix.

The full fine-tuning update $Δ W$ would also be 4096 × 4096. LoRA instead says: constrain $Δ W$ to be the product of two small matrices.

Freeze $W_{0}$ completely. No gradients flow through it.
Add a bypass: the input also flows through $B \times A$ , where:
- $A$ has shape $r \times k$ (e.g., 8 × 4096 = 32K parameters)
- $B$ has shape $d \times r$ (e.g., 4096 × 8 = 32K parameters)
- $r$ is the rank — typically 4, 8, or 16
The adapted output is: $h = W_{0} x + (B A) x$
Initialize $A$ with random Gaussian, $B$ with zeros → $Δ W = B A = 0$ at training start
Only $A$ and $B$ are trained
At inference: merge $W^{'} = W_{0} + B A$ , zero extra latency

ASCII Diagram

  Standard fine-tuning:
  
  x ──→ [W₀ + ΔW] ──→ h
         ↑ full 4096×4096 trainable
  
  ─────────────────────────────────────────────

  LoRA:
  
         ┌──→ [W₀ (frozen)] ──→─────────────┐
  x ──→  │                                   + ──→ h
         └──→ [A (r×k)] ──→ [B (d×r)] ──→──┘
              ↑ trainable ↑   ↑ trainable ↑
              
  A: 8×4096 = 32,768 parameters
  B: 4096×8 = 32,768 parameters
  
  vs. full ΔW: 4096×4096 = 16,777,216 parameters
  
  512x fewer parameters to train in this layer

Math with Translation

The update is $Δ W = B A$ where:

$B \in R^{d \times r}$ — maps from low-rank space back to model dimension
$A \in R^{r \times k}$ — maps from input to low-rank space
$r ≪ min (d, k)$ — the rank constraint

The forward pass adds a scaling term:

$h = W_{0} x + \frac{α}{r} \cdot B A x$

$α$ is a scalar hyperparameter (often set to the same value as $r$ , making $α / r = 1$ )
The scaling makes the effective learning rate of LoRA weights independent of $r$

If $d = k = 4096$ and $r = 8$ :

Full $Δ W$ : 16,777,216 parameters
LoRA ( $A + B$ ): 65,536 parameters
Reduction: 256x

Concrete Walkthrough

Take a 4D toy model: d=4, k=4, rank r=2.

W₀ (frozen) = [[1,0,0,1], [0,1,0,0], [0,0,1,0], [1,0,0,1]]

After 100 training steps, LoRA learns:

A (2×4) = [[0.1, -0.2, 0.0, 0.3], [-0.1, 0.0, 0.2, 0.1]]

B (4×2) = [[0.5, 0.0], [0.0, 0.3], [0.2, 0.1], [0.0, 0.4]]

Then ΔW = B·A (4×4), e.g. entry [0,0] = 0.5·0.1 + 0·(-0.1) = 0.05

The merged weight W’ = W₀ + ΔW. At inference, this is just a 4×4 matmul — identical cost to the original.

For a 7B model with r=8 applied to all Q, K, V, and output projections in every layer: roughly 20M trainable parameters vs 7B — 0.3% of the original.

What’s Clever

The key insight is empirical before it’s theoretical: the intrinsic dimensionality of task-specific fine-tuning is small.

Li et al. (2018) showed that LLMs can be fine-tuned in surprisingly low-dimensional spaces — most tasks don’t need the full expressive power of the weight space. LoRA makes this concrete and practical.

The second clever thing: no inference cost. Adapter-based approaches (Houlsby et al. 2019) also freeze the base model but insert new layers — that changes the architecture and adds latency. LoRA’s BA is merged at deployment, vanishing completely. You get the savings during training, pay nothing at inference.

The third clever thing: swappable specializations from one backbone. Because LoRA adapters are tiny (64MB vs 14GB for the base model), you can serve one base model and hot-swap adapters per request. This is how many production serving systems run tens of fine-tuned variants on a single GPU cluster.

Code

LoRA adapter as a drop-in replacement for nn.Linear:

import torch
import torch.nn as nn
 
class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=4, alpha=1.0):
        super().__init__()
        self.W0 = nn.Linear(in_features, out_features, bias=False)  # pretrained weights
        self.W0.requires_grad_(False)                                # freeze — no gradients
        self.A = nn.Linear(in_features, rank, bias=False)           # down-project to rank
        self.B = nn.Linear(rank, out_features, bias=False)          # up-project back
        nn.init.kaiming_uniform_(self.A.weight)   # A: random init (standard)
        nn.init.zeros_(self.B.weight)             # B: zero init → ΔW=BA=0 at start
        self.scale = alpha / rank                 # scaling: makes LR independent of rank
 
    def forward(self, x):
        return self.W0(x) + self.scale * self.B(self.A(x))  # frozen + low-rank update
 
# Parameter count comparison (in_features=4096, out_features=4096, rank=8):
# Full ΔW: 4096×4096 = 16,777,216 params
# LoRA A+B: 4096×8 + 8×4096 = 65,536 params  (256x reduction)

Key Sources

Open Questions

Optimal rank and which layers to apply LoRA to for different tasks
Whether merged LoRA weights can be composed (LoRA merging/arithmetic)

ML Wiki

Explorer

LoRA (Low-Rank Adaptation)

The Problem

Analogy

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

LoRA (Low-Rank Adaptation)

The Problem

Analogy

Mechanism in Plain English

ASCII Diagram

Math with Translation

Concrete Walkthrough

What’s Clever

Code

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks