What It Is
Long context refers to a language model’s ability to process and reliably reason over very long input sequences — typically hundreds of thousands to millions of tokens. A model has genuine long-context capability when its recall quality stays high across the full window, not just near the start and end.
Why It Matters
Most real-world tasks require more context than a 4k or 8k window allows: full codebases, legal documents, research papers, multi-hour conversations, entire books. Long-context models can take the whole input at once rather than requiring external retrieval pipelines to select relevant chunks.
How It Works
Vanilla self-attention is O(n²) in sequence length, which makes million-token contexts computationally infeasible on a single device. Approaches to extend context include: (1) efficient attention variants like FlashAttention and ring attention that distribute computation across devices; (2) sparse attention patterns that avoid full pairwise comparison; (3) architectural changes like MoE that reduce per-token compute so more tokens can be processed in the same FLOP budget. Training regime matters as much as architecture — models must be trained on genuinely long sequences with dependencies that span the full window, otherwise they learn to ignore distant context even when technically able to attend to it.
Key Sources
- attention-is-all-you-need — introduces the O(n²) attention cost that makes long context expensive
- gemini-1-5-multimodal-long-context — 1M-token context via sparse MoE; near-perfect needle-in-haystack at 10M tokens
- alibi-train-short-test-long — ALiBi’s recency bias enables inference at 2–3× training length; train on 1K tokens, deploy at 2K
- self-rag-learning-to-retrieve-generate-critique — adaptive retrieval as an alternative to long context: retrieves only when needed, filtering irrelevant passages before they enter the window