What It Is

Bidirectional context means a model’s representation of any token incorporates information from both the tokens to its left and the tokens to its right. In a bidirectional model, every token can attend to every other token in the sequence simultaneously, at every layer.

Why It Matters

Natural language understanding is inherently bidirectional. The meaning of a word depends on what comes before and after it. “I made her duck” is ambiguous until you see the full sentence context in both directions. Unidirectional models (like GPT) read left-to-right: when processing “duck”, they only have “I made her” as context. A bidirectional model also has access to what follows, enabling much richer disambiguation. BERT showed that deep bidirectionality — meaning full bidirectional attention at every layer of a deep network — produces substantially better representations than shallow bidirectionality (ELMo: train left-to-right and right-to-left separately, concatenate at the surface level).

How It Works

Bidirectional context is achieved in Transformer encoders by removing the causal attention mask. Standard Transformer decoders (used in autoregressive language models) apply a mask that prevents each token from attending to future positions. Removing this mask allows full sequence-to-sequence attention: token at position i attends to all positions j, regardless of whether j > i or j < i.

The cost of full bidirectionality is that standard next-token prediction becomes trivial (the model can see the answer). BERT resolves this by switching to masked language modeling: hide tokens, then predict them. The hidden tokens cannot attend to themselves — they only see context — so the task remains non-trivial despite bidirectional attention.

Key Sources