What It Is

Long context refers to a language model’s ability to process and reliably reason over very long input sequences — typically hundreds of thousands to millions of tokens. A model has genuine long-context capability when its recall quality stays high across the full window, not just near the start and end.

Why It Matters

Most real-world tasks require more context than a 4k or 8k window allows: full codebases, legal documents, research papers, multi-hour conversations, entire books. Long-context models can take the whole input at once rather than requiring external retrieval pipelines to select relevant chunks.

How It Works

Vanilla self-attention is O(n²) in sequence length, which makes million-token contexts computationally infeasible on a single device. Approaches to extend context include: (1) efficient attention variants like FlashAttention and ring attention that distribute computation across devices; (2) sparse attention patterns that avoid full pairwise comparison; (3) architectural changes like MoE that reduce per-token compute so more tokens can be processed in the same FLOP budget. Training regime matters as much as architecture — models must be trained on genuinely long sequences with dependencies that span the full window, otherwise they learn to ignore distant context even when technically able to attend to it.

Key Sources