Concepts: mixture-of-experts | long-context | in-context-learning | multimodal-embeddings Builds on: attention-is-all-you-need | gqa-grouped-query-attention Leads to:

There’s a standard stress test for language models called “needle in a haystack.” Hide a single sentence somewhere inside a long document — “The best coffee shop in Chicago is the Blue Whale Cafe” — buried somewhere in the middle of 100,000 words. Then ask: what’s the best coffee shop in Chicago?

Most models with “128k context” fail this badly. Not because the answer isn’t in the document — it is, exactly as written — but because their recall collapses in the middle of the window. The model learned during training that answers come near the start or end of context. The middle 60% is a dead zone. A 128k context window was nominally long. In practice it was more like 32k of reliable recall with 96k of wishful thinking.

Gemini 1.5 Pro, introduced in March 2024, solved this. Near-perfect needle retrieval at 1 million tokens. Across text, images, video, and audio. And then the team ran it at 10 million tokens and it still held.

The core idea

The analogy first.

Imagine a large hospital handling a complex case. No single doctor can hold every patient record, lab result, and research paper in their head. So the hospital divides the work: cardiologists review cardiac data, radiologists interpret scans, pharmacists flag drug interactions. When a question arrives — “is this medication safe given this patient’s history?” — it gets routed to the relevant specialists. Each specialist is fast and focused. Together they can reason across an enormous case file that no individual could manage alone.

That’s the architecture. Gemini 1.5 combines two ideas:

Sparse Mixture of Experts (MoE): instead of one large feedforward network in each transformer layer, you have many smaller expert networks. A learned router looks at each incoming token and picks the top two experts to handle it. All other experts sit idle. This gives you 10 times more model capacity without 10 times the compute — the hospital has more specialists, but each patient only sees two of them.

Efficient long-sequence attention: vanilla self-attention is in sequence length. At 1 million tokens, that’s operations per layer. Infeasible. Gemini 1.5 uses memory-efficient attention mechanisms that distribute the computation across many devices using ring attention — each chip holds a slice of the sequence, passes key-value information around the ring, and computes its piece of the full attention without ever materializing the full matrix on a single device.

As the paper describes it, the model is “capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio.”

STANDARD TRANSFORMER (dense, bounded):
  Input: 128k tokens
  Each layer: full attention matrix (128k × 128k) → barely feasible
  Context quality: degrades in middle (model never trained on truly long dependencies)

GEMINI 1.5 (MoE + ring attention):
  Input: 1M tokens (10M tested)
  Attention: distributed across chips via ring — no full matrix materialized
  Each transformer layer: FFN replaced by sparse MoE routing

MoE ROUTING (per token, per layer):
  token embedding [d_model]
        │
   [Router: linear → softmax over K experts]
        │
  Scores: [0.12, 0.31, 0.48, 0.09, ...]  (K experts)
        │
   Top-2 selected: Expert 3, Expert 2
        │
  [Expert 3 FFN]──┐
                   ├─ weighted sum → new token representation
  [Expert 2 FFN]──┘

  Experts 1, 4, 5 ... K : IDLE. Zero compute spent.

The math.

The router assigns a probability to each expert via a simple softmax over a learned linear projection:

where is the token embedding and is the gate matrix. The noise term (typically Gaussian) helps exploration during training and prevents collapse to a few dominant experts.

The layer output is a weighted sum over only the top-K selected experts:

where is expert ‘s feedforward computation and is its re-normalized routing weight.

Walkthrough with real numbers.

Suppose K=4 experts, top-2 routing, token embedding dimension 4:

Token: "enzyme"
Embedding x = [0.4, 0.7, 0.2, 0.5]

Gate matrix W_g (4×4, simplified):
  Expert 1: [0.6, 0.1, 0.3, 0.2]
  Expert 2: [0.1, 0.5, 0.2, 0.6]
  Expert 3: [0.3, 0.4, 0.6, 0.1]
  Expert 4: [0.2, 0.3, 0.1, 0.5]

Step 1: Raw gate scores = x · W_g (dot each row with x)
  Expert 1: 0.4×0.6 + 0.7×0.1 + 0.2×0.3 + 0.5×0.2 = 0.24+0.07+0.06+0.10 = 0.47
  Expert 2: 0.4×0.1 + 0.7×0.5 + 0.2×0.2 + 0.5×0.6 = 0.04+0.35+0.04+0.30 = 0.73
  Expert 3: 0.4×0.3 + 0.7×0.4 + 0.2×0.6 + 0.5×0.1 = 0.12+0.28+0.12+0.05 = 0.57
  Expert 4: 0.4×0.2 + 0.7×0.3 + 0.2×0.1 + 0.5×0.5 = 0.08+0.21+0.02+0.25 = 0.56

Step 2: Softmax → [0.22, 0.30, 0.26, 0.22]

Step 3: Top-2 → Expert 2 (0.30), Expert 3 (0.26)
  Expert 1 and Expert 4: idle.

Step 4: Re-normalize selected → Expert 2: 0.536, Expert 3: 0.464

Step 5: Output = 0.536 × E2("enzyme") + 0.464 × E3("enzyme")

Over training, “enzyme” would consistently route to the biology/chemistry experts. A legal term would route elsewhere. The model learns which specialists to call without being explicitly told — the routing is learned end-to-end from next-token prediction.

What’s clever — find the instinct.

Why did long context actually work here when it hadn’t before?

The naive answer is “better architecture.” That’s not wrong, but it misses the key insight. Researchers had built models with technically long context windows for years. The problem wasn’t that the attention mechanism couldn’t reach distant tokens — it was that the model had no reason to learn to use those distant tokens. Training sequences were short. Long-distance dependencies were rare. The model found it easier to ignore anything far away.

The report notes “continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k).” That 10M-token number is the tell. It’s not a hard limit they hit — it’s as far as they tested, and performance was still improving.

The Kalamang demonstration makes this concrete:

“when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content”

Kalamang has almost no digital presence — there’s nothing to memorize from pretraining. What the model is doing is loading a 250-page grammar book and vocabulary reference into its context, forming an internal model of the grammar rules and morphology, and applying it to novel sentences. That’s not retrieval. That’s comprehension. The model is reading the reference book and actually learning from it, in context, in real time.

The MoE architecture matters here too in a non-obvious way. Dense models store all knowledge in a single shared set of weights. MoE models can specialize different experts for different domains — some learn scientific vocabulary, some learn code, some learn multilingual patterns. When you give the model a new grammar book in context, the multilingually-specialized experts can activate and handle it without disrupting the model’s other capabilities.

Does it actually work? What breaks?

TaskGemini 1.5 ProComparison
Needle-in-haystack retrieval (1M tokens)>99%GPT-4 Turbo: ~70% at 128k, Claude 3: ~70% at 200k
Long-document QASOTA (March 2024)Previous best required chunked RAG pipelines
3-hour video QANew capabilityNo comparable model existed

As the paper states directly: Gemini 1.5 models “match or surpass Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of benchmarks” while being significantly more efficient to deploy — the introduction of Gemini 1.5 Flash, a lighter MoE variant, enabled “minimal regression in quality” at much lower serving cost.

What doesn’t work.

Retrieval and reasoning are different problems. Gemini 1.5 can find the exact sentence in a 1M-token document with near-perfect recall. But synthesizing conclusions across 500 dispersed passages — drawing inferences that require connecting far-apart facts — is a harder problem and less reliably solved. The paper evaluates recall tasks more thoroughly than complex multi-hop reasoning.

MoE models have a load-balancing problem. If the router learns to always send tokens to the same two popular experts, the others never get trained and the model wastes capacity. Gemini 1.5 almost certainly uses auxiliary load-balancing losses during training, but the paper doesn’t expose these implementation details.

Context cost at inference time is real. A 1M-token prompt means a massive KV cache. Even with ring attention, serving a million-token query requires substantial memory bandwidth across many chips. This is manageable at the Pro tier but is why Flash exists for the common case.

So what?

If you’re building systems that process large documents — legal discovery, code repositories, scientific literature, long conversations — Gemini 1.5 changes the design decision. Before it, long-context retrieval meant RAG: chunk the input, embed it, retrieve the relevant pieces, and pass those pieces to a short-context model. That worked, but it required: a chunking strategy, an embedding model, a vector store, a retrieval step, and careful prompt construction to avoid context gaps.

Now there’s a genuine alternative: load the whole thing. Whether RAG or long-context is the right choice is now an actual engineering tradeoff — latency, cost per query, recall precision, whether you need to synthesize across sources or just retrieve them — rather than a forced architectural decision.

The bigger signal is what this says about in-context-learning. The in-context learning of GPT-3 showed that a few examples in a prompt could teach a model a new task. The Kalamang result suggests this extends far further: a complete learning resource in context can teach a genuinely new skill. The boundary between “what the model knows from training” and “what it can learn from context” is much further out than the field assumed. Combining that with mixture-of-experts architecture — more specialized capacity at the same compute cost — is what made it practical.

Long-context MoE is the architecture that makes a language model behave like a collaborator who can actually read the whole file.

Connections

Citation

arXiv:2403.05530

Gemini Team Google. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint. https://arxiv.org/abs/2403.05530