What It Is

A special learnable vector prepended to the input sequence in transformer models — used as a dedicated slot to accumulate a global representation of the entire sequence. It has no semantic content at initialization; its purpose is to collect information from all other tokens through attention across every Transformer layer. At the final layer, only the CLS token’s state is passed to the classification head. Originally introduced in BERT (Devlin et al., 2018) for sentence-level tasks; adopted by ViT for image classification.

Why It Matters

Without the CLS token, getting a fixed-size sequence-level representation from a Transformer requires aggregating across all output token states — typically by mean pooling or max pooling. The CLS token gives the model a learned alternative: it can use attention to weight which tokens contribute to the summary, rather than treating all tokens equally. Empirically, the CLS token approach often outperforms mean pooling for classification, while mean pooling is more robust for retrieval and similarity tasks where all tokens should contribute equally.

How It Works

Architecture

Input sequence:    [word₁] [word₂] [word₃] ... [wordₙ]
After prepend:     [CLS]   [word₁] [word₂] ... [wordₙ]
After embedding:   [v_cls] [v₁]    [v₂]    ... [vₙ]    → into Transformer

Layer 1:
  CLS attends to word₁, word₂, ..., wordₙ → updates v_cls
  word₁ attends to CLS, word₂, ... → updates v₁
  (all positions attend to all others including CLS)

Layer 2:
  CLS now contains info from Layer 1 representations → attends again
  Iterative refinement: each layer gives CLS another pass over the sequence

Final layer:
  v_cls = final CLS state → classification head (linear layer)

Why Prepend, Not Append

The CLS token is placed at position 0 (prepended) rather than at the end. In BERT (bidirectional encoder), position doesn’t affect information access — every token attends to every other — so prepending vs. appending is equivalent. But the prepend convention persists because:

  1. It was the original BERT design choice.
  2. In causal (decoder-only) models, a token only sees tokens before it, so prepending a summary token at position 0 would be useless (it can’t attend to any future tokens). This is why decoder-only models don’t use CLS tokens.

In ViT

224×224 image → 196 patch embeddings [p₁, p₂, ..., p₁₉₆]
Prepend CLS:   [CLS, p₁, p₂, ..., p₁₉₆]  → 197 tokens

After 12 Transformer layers:
  CLS state = global image summary
  → linear head → 1000-class softmax

The patch tokens collectively represent local image regions; the CLS token learns to aggregate whatever the classification task needs. For ImageNet classification, it learns to focus on discriminative regions. For segmentation, it’s less useful — dense prediction tasks need all 196 patch token outputs, not just the CLS token.

CLS vs. Global Average Pooling (GAP)

MethodWhat it computesTask fit
CLS tokenLearned weighted combination via attentionClassification (discriminative)
Global Average PoolingUnweighted average of all token outputsRetrieval, similarity
Max poolingMax activation per dimension across tokensFeature detection

For sentence similarity (SBERT), mean pooling of BERT’s token outputs outperforms the CLS token despite BERT being trained with CLS for classification. The CLS token is optimized for classification loss — it may ignore information that’s irrelevant for the training task but relevant for similarity.

Attention Pattern of the CLS Token

Interpretability research finds that CLS attention patterns are diffuse at early layers (the CLS token attends broadly to gather information) and become concentrated at later layers (focused on the most task-relevant tokens). This is consistent with the CLS token serving as a “query” that progressively refines what to attend to. However, attention weights are a noisy proxy for information flow — the CLS token may get information through indirect paths (CLS → token A → token B) that don’t show up in direct attention weights.

What’s Clever

The CLS token is an elegant solution to the “sequence pooling” problem. Global average pooling aggregates all tokens equally — treating “the” (high frequency, low information) the same as “revolutionary” (low frequency, high information). Max pooling is uninterpretable across dimensions. The CLS token lets the model learn which tokens matter for the task through attention, trained end-to-end with the classification objective.

The non-obvious subtlety: the CLS token cannot receive information during the first layer — it starts as a random initialization with no semantic content. By layer N, it has iteratively gathered and refined a global summary. This means deeper models give the CLS token more refinement passes, which is part of why deeper Transformers tend to produce better classification representations from the CLS token.

Key Sources

  • vision-transformer — uses CLS token for image classification
  • attention — the mechanism through which CLS token gathers information
  • patch-embeddings — the patch tokens that the CLS token attends over in ViT
  • transfer-learning — CLS token representations transfer to downstream tasks; mean pooling sometimes transfers better

Open Questions

  • When does CLS outperform mean pooling, and when does it underperform? (Task type, training objective, and model depth all seem to matter)
  • Can learnable pooling strategies (cross-attention over a small set of learned query vectors) outperform the single CLS token?
  • In models with many CLS tokens (multiple [CLS] for multi-label classification), do different CLS tokens specialize?