The Problem
Transformers are quadratic in sequence length. Doubling the context quadruples the attention computation and memory. At 100K tokens, the attention matrix alone is 100K × 100K = 10 billion entries — 40GB in float32 before you’ve touched the model weights. At a million tokens, it’s simply infeasible.
RNNs are linear — they compress everything into a fixed-size hidden state and process sequences one step at a time. But that compression is lossy. LSTMs and GRUs forget things. And sequential processing means you can’t parallelize over sequence length during training.
Is there something that’s linear like an RNN but trainable like a Transformer?
Analogy
Imagine tracking public sentiment about a brand. You don’t store every tweet ever written — you maintain a running summary: sentiment score, trending topics, velocity of change. When a new tweet arrives, you update the summary. Old tweets fade as their relevance decays.
That’s an SSM. The hidden state is the summary. The update rule is the recurrence. The key question is: what should the summary track, and how fast should old information fade?
A vanilla SSM has fixed decay rates — every topic fades at the same speed regardless of what’s happening. Mamba makes the decay rates input-dependent: breaking news should update fast and old stable sentiment can fade slow. The model learns when to remember and when to forget, based on what it’s actually reading.
Mechanism in Plain English
A State Space Model defines:
h_t = A·h_{t-1} + B·x_t (state update)
y_t = C·h_t (output)
h_tis the hidden state (compressed memory of everything so far)Ais the transition matrix (how much of the old state to keep)Bmaps the input to a state updateCextracts the relevant information from the state to produce output
The problem with vanilla SSMs: A, B, C are fixed for all positions. The same decay rate applies to “the cat sat on the” as to “3 months later, the treaty was…” The model can’t decide that this particular input should flush old state or that this token should be carefully remembered.
Mamba’s selective SSM (S6): make B, C, and the time step Δ functions of the input x_t. Now the model can say: “this input is important — slow down the decay (large Δ) so it lingers in state. That input is noise — speed up the decay (small Δ) to flush it.”
This is called selectivity — input-dependent control over information flow.
ASCII Diagram
Vanilla SSM (fixed parameters):
x1 → [A,B,C fixed] → h1 → y1
x2 → [A,B,C fixed] → h2 → y2 A,B,C same regardless of content
x3 → [A,B,C fixed] → h3 → y3
...
──────────────────────────────────────────────────────
Mamba selective SSM:
x1 → Δ1=f(x1), B1=g(x1), C1=h(x1) → h1 → y1
x2 → Δ2=f(x2), B2=g(x2), C2=h(x2) → h2 → y2 ← parameters change per token
x3 → Δ3=f(x3), B3=g(x3), C3=h(x3) → h3 → y3
Example: x3 = "IMPORTANT:" → large Δ3, B3 opens gate wide → strong state update
x4 = "the" → small Δ4, B4 nearly closed → state barely updates
──────────────────────────────────────────────────────
Training vs. inference:
Training (parallel scan):
x1 x2 x3 x4 ... → compute all h_t simultaneously → fast
Inference (recurrent):
x1 → h1 → x2 → h2 → x3 → h3 sequential, but O(1) memory per step
KV cache size: fixed (just h_t, the state vector)
Transformer KV cache: grows with every token
Math with Translation
Discrete SSM (after discretizing the continuous-time system):
h_t = Ā·h_{t-1} + B̄·x_t
y_t = C·h_t
where Ā = exp(ΔA), B̄ = (ΔA)⁻¹(exp(ΔA) - I)·ΔB
Δ(delta) — time step size; controls how much the continuous system “moves” per discrete step. Large Δ = state updates aggressively. Small Δ = mostly preserves old state.A— state transition matrix; in structured SSMs (S4), A is diagonal or structured for efficiencyĀ— the effective state transition after discretization; eigenvalues determine decay ratesB̄·x_t— the “write” operation: how much of input x_t to inject into state
In Mamba, Δ, B, C are all linear projections of x_t:
Δ = softplus(Linear(x_t))
B = Linear(x_t)
C = Linear(x_t)
This breaks the time-invariance — now the model can’t use convolutions (the standard SSM training trick), but the hardware-aware parallel scan replaces it.
Concrete Walkthrough
State size N=2, input dimension D=1. Sequence: x = [1.0, 0.5, 3.0].
Fixed parameters (simplified): A = [[0.9,0],[0,0.5]], B = [1,1]ᵀ, C = [1,0]
Initial state h_0 = [0, 0].
Step 1 (x=1.0):
h_1 = A·h_0 + B·x_1 = [[0.9,0],[0,0.5]]·[0,0] + [1,1]·1.0 = [1.0, 1.0]
y_1 = C·h_1 = [1,0]·[1.0,1.0] = 1.0
Step 2 (x=0.5):
h_2 = A·h_1 + B·x_2 = [0.9·1.0, 0.5·1.0] + [0.5, 0.5] = [1.4, 1.0]
y_2 = [1,0]·[1.4,1.0] = 1.4
Step 3 (x=3.0):
h_3 = A·h_2 + B·x_3 = [0.9·1.4, 0.5·1.0] + [3.0, 3.0] = [4.26, 3.5]
y_3 = [1,0]·[4.26,3.5] = 4.26
Notice: h_1 = 1.0 from step 1 is still visible in h_3 via the 0.9 decay — but h_2’s second component (0.5 decay) is already 0.5·0.5=0.25 of the step 1 contribution. Different components of the state decay at different rates.
In Mamba, Δ would be large at step 3 (x=3.0 is a strong signal) and small at step 2 (x=0.5 is weak), controlling how aggressively the state updates.
What’s Clever
The core insight is separating the expressiveness question from the efficiency question.
Prior SSMs (S4, H3) were efficient but time-invariant — they couldn’t do content-based reasoning. Transformers are expressive but quadratic. The assumption everyone made: you need the quadratic attention matrix to do content-based routing.
Mamba showed this isn’t true. Selectivity (input-dependent parameters) gives you content-based routing inside a recurrent computation that never builds the N×N matrix.
The hardware-aware parallel scan is the engineering breakthrough that makes this trainable. Even though Mamba runs as a recurrence at inference, during training you can compute all hidden states simultaneously using a scan algorithm that’s as parallelizable as convolution. This is how Mamba gets the best of both worlds: parallel training, recurrent inference.
The practical consequence: at inference, Mamba’s memory footprint is O(1) per token — just the fixed-size state vector. Transformers require O(N) for the KV cache. For very long contexts (>100K tokens), this difference is enormous.
Key Sources
Related Concepts
Open Questions
- Can pure SSMs match Transformers on in-context learning and retrieval tasks?
- Optimal mixing ratio for hybrid Mamba-Transformer models
- Whether selective SSMs can be made as efficient as FlashAttention