Every time your language model reads a new word, it re-reads every previous word to decide if it matters. A 1,000-word context means 1 million comparisons. A 10,000-word context means 100 million. That’s not a performance quirk — it’s quadratic scaling baked into attention’s DNA. Researchers had faster alternatives (recurrent models, state space models), but they came with a brutal catch: fixed memory. They couldn’t decide what was worth remembering. They treated “the” the same as “Einstein.” Mamba fixes the catch.

The core mechanism

Imagine a very attentive librarian whose job is to maintain a one-page summary of every book read so far. Old-school recurrent models gave this librarian a rigid form to fill out — same fields, same weights — regardless of what they were reading. The librarian dutifully recorded plot summary, character count, page length, even for books about nothing. Every token, same treatment.

Mamba gives the librarian the ability to change the form based on what they’re reading. Hit a character’s name? Open a slot and write it in prominently. Hit a filler word? Let it pass without touching the summary. The form itself adapts. That’s selectivity.

Here’s what happens mechanically, step by step:

  1. The model maintains a hidden state — a compact summary of everything seen so far, sized N × D (where N=16 internal dimensions per D=2048 channels = 32,768 total “slots”)
  2. For each new token, the model runs that token through three tiny projections to compute three input-dependent scalars: Δ (how much attention to pay), B (how to write this token into memory), and C (how to read from memory for the output)
  3. Δ (delta) is the master control. Large Δ means “this token matters — reset some state and absorb it.” Small Δ means “noise — barely touch the state, let history persist.”
  4. The new hidden state is: h_new = A_bar × h_old + B_bar × x_current — where A_bar and B_bar are computed FROM x, not fixed constants
  5. The output for this token is: y = C × h_new — C also comes from the current token, so the model controls what it reads out too

The magic is step 2-4: the parameters that govern memory update are computed on-the-fly from the token being processed. Prior SSMs had fixed A, B, C for all tokens. Making them input-dependent is the entire paper’s contribution — one conceptual change, enormous consequences.

TRADITIONAL SSM (fixed rules, fast but dumb):

  Token 1 ──► [A, B, C fixed] ──► state_1
  Token 2 ──► [A, B, C fixed] ──► state_2    ← same rules for "the" and "Einstein"
  Token 3 ──► [A, B, C fixed] ──► state_3

MAMBA (rules computed from each token):

  "the"      ──► [Δ≈0.1, B(x), C(x)] ──► state barely changes
  "Einstein" ──► [Δ≈2.1, B(x), C(x)] ──► state strongly updates
  "said"     ──► [Δ≈0.3, B(x), C(x)] ──► state barely changes

  Big Δ → absorb token strongly, partial reset of old history
  Small Δ → ignore token, preserve old history intact

MAMBA BLOCK (what replaces the Transformer block):

  Input
    │
    ├──────────────────────┐
    ▼                      ▼
  Linear (expand)        Linear (expand)
    │                      │
    ▼                      │
  Conv1d (local context)  │
    │                      │
    ▼                      │
  SiLU activation         │
    │                      │
    ▼                      │
  Selective SSM (S6)      │
  [computes Δ,B,C          │
   from input]             │
    │                      │
    └────── × (gate) ◄─────┘
             │
             ▼
           Linear (project back)
             │
             ▼
           Output

  No attention. No MLP. Just this, stacked 48 times.

The key equation is:

h_t = Ā × h_{t-1} + B̄ × x_t

where Ā = exp(Δ·A) and B̄ ≈ Δ·B

  • h_t — the hidden state after reading token t: your running summary of everything
  • Ā — how much of the old summary to keep (0=forget all, 1=keep all); derived from Δ, so it varies per token
  • h_{t-1} — the previous summary
  • — how strongly to write the new token into memory; also derived from Δ, varies per token
  • x_t — the current token embedding

What Δ does: Ā = exp(Δ·A). When Δ is large (say 3.0) and A is negative (say -1), Ā = exp(-3.0) = 0.05. The old state is almost erased and the new token dominates. When Δ is small (say 0.1), Ā = exp(-0.1) = 0.90. Old history persists strongly.

Walkthrough with actual numbers:

Setup: hidden state dimension N=2, single channel. We process three tokens: “um”, “Paris”, “is”.

State starts empty: h = [0.0, 0.0]. Fixed matrix A = [-1.0, -2.0] (learned, constant).

Token 1: “um” → embedding x = 0.1 (small, filler word)

Δ = softplus(0.5 × 0.1) = 0.050
Ā = exp(0.050 × [-1, -2]) = [0.951, 0.905]
B̄ = 0.050 × [0.12, 0.08] = [0.006, 0.004]

New state: h = [0.951×0 + 0.006×0.1, 0.905×0 + 0.004×0.1]
h = [0.0006, 0.0004] ← barely anything written in

Output: y = C·h ≈ 0.000070 (tiny, "um" contributed almost nothing)

Token 2: “Paris” → embedding x = 0.9 (large, meaningful noun)

Δ = softplus(0.5 × 0.9) = 0.450
Ā = exp(0.450 × [-1, -2]) = [0.638, 0.407]
B̄ = 0.450 × [1.08, 0.72] = [0.486, 0.324]

New state: h = [0.638×0.0006 + 0.486×0.9, 0.407×0.0004 + 0.324×0.9]
h = [0.437, 0.292] ← Paris strongly written into state

Output: y = C·h ≈ 0.460 (strong signal)

Compare: “um” contributed 0.000070 to output. “Paris” contributed 0.460. A ratio of ~6,500:1 for the same model, same parameters — just different Δ values driven by different inputs.

“The main difference is simply making several parameters Δ, B, C functions of the input, along with the associated changes to tensor shapes throughout.”

Translation: the authors are almost understating it. The code diff is small — swap three constant tensors for three tiny linear projections. But the consequences are enormous: the model gains content-awareness, the core capability that had kept SSMs below Transformers on language tasks.

“A fundamental problem of sequence modeling is compressing context into a smaller state. Efficient models must have a small state, while effective models must have a state that contains all necessary information from the context.”

Translation: there’s a fundamental tension. Attention refuses to compress (keeps everything, pays O(n²)). Fixed recurrences compress without discretion (pays O(n), but dumb). Mamba’s answer: compress, but do it intelligently.

“Selectivity allows filtering out irrelevant noise tokens that may occur between inputs of interest. This is exemplified by the Selective Copying task, but occurs ubiquitously in common data modalities, particularly for discrete data — for example the presence of language fillers such as ‘um’.”

What’s clever — the instinct:

Why did this work when so many smarter-seeming approaches failed? The non-obvious insight is that input-dependent gating already existed (LSTMs have had forget gates since 1997). So why hadn’t this been done to SSMs before?

The blocker was computational, not conceptual. The reason all prior SSMs used fixed (Linear Time Invariant) parameters is that LTI lets you compute the model as a convolution — one big parallel operation over the whole sequence. Once parameters become input-dependent, the sequence is no longer a convolution. You’re back to sequential recurrence. At sequence length 100K, sequential recurrence on a GPU is painful.

The authors’ solution was hardware-aware: don’t materialize the expanded state (shape B×L×D×N) in slow GPU memory at all. Instead, load the compact parameters (Δ, A, B, C) into fast SRAM, run the recurrence there with a parallel prefix scan (the GPU-friendly cumulative-sum operation), and write only the final outputs back to slow memory. The state dimension N never has to live in HBM.

“We load the SSM parameters (Δ, A, B, C) directly from slow HBM to fast SRAM, perform the discretization and recurrence in SRAM, and then write the final outputs of size (B, L, D) back to HBM.”

Translation: same IO-minimization philosophy as FlashAttention — never let the big intermediate tensors touch slow GPU memory. The insight that this was possible for recurrences (not just attention) is what made Mamba practical, not just theoretically appealing.

The combination is the real contribution: selectivity (conceptual) + hardware-aware scan (engineering). Either alone is insufficient. Together, they break the efficiency-effectiveness tradeoff.

Does it actually work?

ModelParamsPerplexity (lower=better)Notes
Pythia (Transformer)2.8BbaselineStrong modern Transformer
Mamba1.4Bmatches Pythia 2.8BHalf the parameters, same quality
Mamba2.8Bbeats Pythia 2.8B by 4pts avgOn common sense reasoning downstream

Beyond language:

  • Inference throughput: 5× higher than Transformer of same size at length 2048; gap widens with longer sequences
  • Audio generation (SC09 speech): FID 0.29 vs. prior best 2.42 — more than 8× improvement
  • DNA modeling (GenomicsBenchmarks): New SOTA; the million-base-pair sequences are a natural use case for linear-scaling models

What doesn’t work:

Fixed-size state means perfect recall is impossible. Tasks that require exact lookup of an arbitrary earlier token (“what was the 847th word?”) will break Mamba’s compressed state. Attention can do this trivially because it keeps everything. Mamba can’t — it’s a fundamental information-theoretic constraint, not an implementation bug.

The paper is also honest about scale: “all of our experiments use models up to 1.3B parameters… scaling behavior at larger model sizes is left for future work.” Whether Mamba’s edge holds at 70B+ remained an open question at publication.

Custom CUDA kernels are required. The hardware-aware scan isn’t implementable in standard PyTorch. Ecosystem maturity (tooling, quantization libraries, serving infrastructure) lags Transformers by years.

So what?

If you’re building ML systems, the decision tree is roughly: under 128K context with standard hardware? Transformer + FlashAttention is battle-tested and tooling is mature. Need 1M+ context, ultra-low-latency generation, or predictable memory footprint? Mamba is worth a serious look. The fixed state size means memory per inference step doesn’t grow with context length — a chatbot handling 100K-token conversations uses the same VRAM per step as one handling 100-token ones. That economics matters at scale.

Mamba proves you don’t need attention to build a great language model — just a recurrence smart enough to decide what to remember.

Connections

  • transformer — the architecture Mamba challenges
  • attention — what Mamba replaces with selective SSM
  • flash-attention — shares the hardware-aware SRAM optimization philosophy
  • kv-cache — Mamba avoids the KV cache problem with fixed-size state

Citation

arXiv:2312.00752

Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752