Concepts: masked-language-model | vision-transformer | self-supervised-learning | patch-embeddings | pre-training Builds on: attention-is-all-you-need | simclr-contrastive-learning-visual-representations Leads to: dino-self-supervised-vision-transformers | dinov2-learning-robust-visual-features

Training a vision model with labels is expensive. ImageNet has 1.2 million labeled images — each one hand-annotated by a human. But the internet has billions of images with no labels at all. The question has always been: can we train on the unlabeled ones?

BERT answered this for text in 2018 by masking random words and asking the model to fill them in. Nobody needed to label anything — the text itself is the supervision. MAE (He et al., 2022) does the same for images, and the result is a ViT-Huge model that hits 87.8% top-1 on ImageNet — trained entirely without labels during pre-training, using only the task of predicting masked patches.

The core idea

The analogy: Imagine someone removes 75% of a jigsaw puzzle’s pieces and asks you to reconstruct the missing sections. If only 15% of pieces were removed, you could guess the missing ones just by looking at the neighboring edges — it’s too easy. But at 75% gone, you actually have to understand what the image depicts. You can’t fake it with local interpolation. You have to know that the partial face you’re seeing is a face, and reason about where the eyes and nose should be.

MAE works exactly like this. It masks 75% of an image’s patches at random and trains a model to reconstruct the missing pixels. The high masking ratio is the key design choice — it prevents the model from cheating by interpolating from nearby patches. The model has to build real semantic understanding to fill in 147 out of 196 patches.

But there’s a second clever insight that makes this practical: the encoder never sees the masked patches. It only processes the visible 25%. That means instead of running an expensive ViT over 196 tokens, the encoder runs over ~49 tokens. That alone gives a 3x training speedup.

Step by step:

  1. Divide the input image into non-overlapping 16×16 patches (a 224×224 image → 196 patches total)
  2. Sample a random 25% of patches (≈49 patches) to keep as visible
  3. Pass only the visible patches through the encoder — a standard ViT with full self-attention
  4. The encoder outputs a latent vector for each visible patch
  5. Pass those latent vectors plus learnable mask tokens (positional-embedded) through a lightweight decoder
  6. The decoder predicts the raw pixel values for every masked patch
  7. Compute mean squared error loss only on the masked patches, normalized per patch
INPUT IMAGE (224×224)
  ↓ split into 14×14 grid of 16×16 patches = 196 patches total
  ↓ randomly mask 75% = 147 patches removed, 49 patches visible

VISIBLE PATCHES (49)            MASK TOKENS (147 learnable vectors)
      ↓                                     ↓
  ┌──────────────────────┐                  │ (skipped by encoder)
  │    ENCODER (ViT-L)   │                  │
  │  self-attention over │                  │
  │  only 49 tokens      │                  │
  └──────────┬───────────┘                  │
             │ latent vectors               │
             ↓                              ↓
  ┌────────────────────────────────────────────────┐
  │       DECODER (lightweight, 8 blocks)          │
  │  input: 49 encoder outputs + 147 mask tokens   │
  │  all 196 positions, positional-embedded         │
  │  output: predicted pixel values for all 196     │
  └───────────────────────┬────────────────────────┘
                          ↓
             MSE loss on masked patches only
             (per-patch normalized pixel targets)

The math:

The reconstruction loss is:

where:

  • — the set of masked patch indices
  • — the decoder’s predicted pixel values for patch (768-dim vector for a 16×16 RGB patch)
  • — the normalized ground-truth pixel values for patch

The key detail: pixels are normalized per patch — subtract each patch’s mean, divide by its std — before computing the loss. This removes the low-frequency color signal and forces the model to learn structure.

Walkthrough with real numbers:

Setup: 224×224 image, 16×16 patches

Patches total:     14 × 14 = 196
Masking ratio:     75% → 147 masked, 49 visible

Encoder sequence length:  49 tokens  (not 196)
Token embedding dim:      1024 (ViT-Large)

Attention cost per layer ∝ n²:
  Full sequence:  196² = 38,416
  MAE encoder:    49²  =  2,401
  Ratio:          ~16x fewer attention ops per layer
  Wall-clock:     ~3x faster training overall

Decoder:
  Input:   49 encoder latents + 147 mask-position tokens = 196 total
  Width:   512 dims (vs encoder's 1024)
  Depth:   8 blocks (vs encoder's 24)
  Params:  ~10% of encoder size

Output per masked patch:  16 × 16 × 3 = 768 values
Loss computed on:         147 masked patches only

Example patch normalization (16×16 patch, simplified to 4 pixels):
  Raw pixels:      [120, 130, 125, 118]
  Mean:            123.25
  Std:             4.79
  Normalized:      [-0.68, +1.41, +0.36, -1.09]
  Decoder targets: the normalized values, not raw pixels

What’s clever — find the instinct:

The instinct starts with a simple observation: images are far more spatially redundant than text. In a sentence, masking “The cat __ on the mat” genuinely requires semantic reasoning to guess “sat.” In an image, if you mask 15% of pixels, a model can look left, right, up, and down to interpolate what’s missing. No semantic understanding needed.

“In language, the information density is high and a mask can be used to capture it efficiently. In vision, spatial redundancy is high. Masking a small portion of an image will not create a meaningful challenge.”

So the high masking ratio is not arbitrary — it’s a direct response to the structure of images. 75% removes enough context that local interpolation fails. The model has to understand what’s being depicted.

The second non-obvious move: strip mask tokens from the encoder entirely. Earlier masked image modeling approaches (BEiT, iGPT) ran the full sequence through the encoder — including placeholder tokens for the masked positions. That means the expensive self-attention computation spans all 196 positions, even the 147 the model is supposed to predict.

MAE’s insight: the mask tokens contribute nothing to the encoder. They’re placeholders. Running them through a 24-layer ViT-L is pure waste. So remove them, let the encoder see only the 49 visible patches, then hand everything to a cheap decoder. The encoder is now 4x shorter. The decoder handles the position-aware reconstruction.

“This design greatly reduces computation. In our setting with a high masking ratio, the encoder processes only a small fraction of patches… allowing us to train large models efficiently.”

Does it actually work? What breaks?

ImageNet classification:

ModelPre-trainingFine-tuning top-1Linear probe
ViT-BaseMAE83.1%67.8%
ViT-LargeMAE85.9%73.5%
ViT-HugeMAE87.8%77.2%
ViT-LargeMoCo v3 (contrastive)84.1%76.7%
ViT-LargeBEiT85.2%

87.8% with ViT-Huge is best-in-class for methods that use only ImageNet-1K data.

The honest nuance: at linear probe, MAE ViT-L (73.5%) loses to MoCo v3 ViT-L (76.7%). Contrastive methods build better frozen features. MAE builds better features for fine-tuning. In practice, fine-tuning is how most people use pre-trained backbones, so MAE wins in the scenarios that matter most.

Transfer learning (downstream tasks, ViT-Large fine-tuned):

TaskMAE pre-trainingSupervised pre-trainingGain
COCO detection (APbox)53.351.2+2.1
COCO segmentation (APmask)47.244.8+2.4
ADE20K segmentation (mIoU)48.147.4+0.7

Self-supervised pre-training now beats supervised pre-training for dense vision tasks. That’s the headline result beyond the classification number.

Masking ratio ablation:

RatioFine-tuning accuracy
40%85.3%
60%85.7%
75%85.9%
85%85.7%
90%85.1%

75% is the sweet spot. Too low: the task is trivially solvable by interpolation. Too high: not enough signal left for meaningful encoding.

What doesn’t work:

Linear probe accuracy lags contrastive methods. If you need a frozen backbone (e.g., for a quick downstream classifier without any fine-tuning), contrastive pre-training is likely better.

Reconstruction of raw pixels is noisier than reconstructing discrete tokens (BEiT’s approach). MAE compensates with per-patch normalization, but the target is inherently pixel-level, not semantic-level. For tasks requiring very coarse semantic features, the gap vs. contrastive methods can be larger.

Practitioner notes

If you’re building a vision system and need to pre-train on unlabeled data: MAE is the default recipe. High masking ratio (75%), asymmetric encoder-decoder, reconstruct normalized pixels. The official checkpoints (facebook/vit-mae-{base,large,huge} on Hugging Face) are production-ready starting points.

If your task is dense prediction (detection, segmentation, depth): MAE pre-training beats supervised ImageNet pre-training on every dense task in the paper. Use it.

If you’re label-constrained: MAE’s fine-tuned features are highly label-efficient. With 1% of ImageNet labels for fine-tuning, MAE outperforms training from scratch with 100% of labels.

If you need frozen features (linear probe, k-NN): consider DINO or MoCo v3 instead. MAE is optimized for the fine-tuning path, not frozen feature quality.

The encoder-only-on-visible-patches design pattern has generalized widely: VideoMAE applies it to video (masking 90% of tubes), Audio-MAE to spectrograms, Point-MAE to point clouds. Wherever you have structured, spatially redundant inputs and abundant unlabeled data, this recipe applies directly.

Connections

Citation

arXiv:2111.06377

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. https://arxiv.org/abs/2111.06377