Concepts: masked-language-model | vision-transformer | self-supervised-learning | patch-embeddings | pre-training Builds on: attention-is-all-you-need | simclr-contrastive-learning-visual-representations Leads to: dino-self-supervised-vision-transformers | dinov2-learning-robust-visual-features
Training a vision model with labels is expensive. ImageNet has 1.2 million labeled images — each one hand-annotated by a human. But the internet has billions of images with no labels at all. The question has always been: can we train on the unlabeled ones?
BERT answered this for text in 2018 by masking random words and asking the model to fill them in. Nobody needed to label anything — the text itself is the supervision. MAE (He et al., 2022) does the same for images, and the result is a ViT-Huge model that hits 87.8% top-1 on ImageNet — trained entirely without labels during pre-training, using only the task of predicting masked patches.
The core idea
The analogy: Imagine someone removes 75% of a jigsaw puzzle’s pieces and asks you to reconstruct the missing sections. If only 15% of pieces were removed, you could guess the missing ones just by looking at the neighboring edges — it’s too easy. But at 75% gone, you actually have to understand what the image depicts. You can’t fake it with local interpolation. You have to know that the partial face you’re seeing is a face, and reason about where the eyes and nose should be.
MAE works exactly like this. It masks 75% of an image’s patches at random and trains a model to reconstruct the missing pixels. The high masking ratio is the key design choice — it prevents the model from cheating by interpolating from nearby patches. The model has to build real semantic understanding to fill in 147 out of 196 patches.
But there’s a second clever insight that makes this practical: the encoder never sees the masked patches. It only processes the visible 25%. That means instead of running an expensive ViT over 196 tokens, the encoder runs over ~49 tokens. That alone gives a 3x training speedup.
Step by step:
- Divide the input image into non-overlapping 16×16 patches (a 224×224 image → 196 patches total)
- Sample a random 25% of patches (≈49 patches) to keep as visible
- Pass only the visible patches through the encoder — a standard ViT with full self-attention
- The encoder outputs a latent vector for each visible patch
- Pass those latent vectors plus learnable mask tokens (positional-embedded) through a lightweight decoder
- The decoder predicts the raw pixel values for every masked patch
- Compute mean squared error loss only on the masked patches, normalized per patch
INPUT IMAGE (224×224)
↓ split into 14×14 grid of 16×16 patches = 196 patches total
↓ randomly mask 75% = 147 patches removed, 49 patches visible
VISIBLE PATCHES (49) MASK TOKENS (147 learnable vectors)
↓ ↓
┌──────────────────────┐ │ (skipped by encoder)
│ ENCODER (ViT-L) │ │
│ self-attention over │ │
│ only 49 tokens │ │
└──────────┬───────────┘ │
│ latent vectors │
↓ ↓
┌────────────────────────────────────────────────┐
│ DECODER (lightweight, 8 blocks) │
│ input: 49 encoder outputs + 147 mask tokens │
│ all 196 positions, positional-embedded │
│ output: predicted pixel values for all 196 │
└───────────────────────┬────────────────────────┘
↓
MSE loss on masked patches only
(per-patch normalized pixel targets)
The math:
The reconstruction loss is:
where:
- — the set of masked patch indices
- — the decoder’s predicted pixel values for patch (768-dim vector for a 16×16 RGB patch)
- — the normalized ground-truth pixel values for patch
The key detail: pixels are normalized per patch — subtract each patch’s mean, divide by its std — before computing the loss. This removes the low-frequency color signal and forces the model to learn structure.
Walkthrough with real numbers:
Setup: 224×224 image, 16×16 patches
Patches total: 14 × 14 = 196
Masking ratio: 75% → 147 masked, 49 visible
Encoder sequence length: 49 tokens (not 196)
Token embedding dim: 1024 (ViT-Large)
Attention cost per layer ∝ n²:
Full sequence: 196² = 38,416
MAE encoder: 49² = 2,401
Ratio: ~16x fewer attention ops per layer
Wall-clock: ~3x faster training overall
Decoder:
Input: 49 encoder latents + 147 mask-position tokens = 196 total
Width: 512 dims (vs encoder's 1024)
Depth: 8 blocks (vs encoder's 24)
Params: ~10% of encoder size
Output per masked patch: 16 × 16 × 3 = 768 values
Loss computed on: 147 masked patches only
Example patch normalization (16×16 patch, simplified to 4 pixels):
Raw pixels: [120, 130, 125, 118]
Mean: 123.25
Std: 4.79
Normalized: [-0.68, +1.41, +0.36, -1.09]
Decoder targets: the normalized values, not raw pixels
What’s clever — find the instinct:
The instinct starts with a simple observation: images are far more spatially redundant than text. In a sentence, masking “The cat __ on the mat” genuinely requires semantic reasoning to guess “sat.” In an image, if you mask 15% of pixels, a model can look left, right, up, and down to interpolate what’s missing. No semantic understanding needed.
“In language, the information density is high and a mask can be used to capture it efficiently. In vision, spatial redundancy is high. Masking a small portion of an image will not create a meaningful challenge.”
So the high masking ratio is not arbitrary — it’s a direct response to the structure of images. 75% removes enough context that local interpolation fails. The model has to understand what’s being depicted.
The second non-obvious move: strip mask tokens from the encoder entirely. Earlier masked image modeling approaches (BEiT, iGPT) ran the full sequence through the encoder — including placeholder tokens for the masked positions. That means the expensive self-attention computation spans all 196 positions, even the 147 the model is supposed to predict.
MAE’s insight: the mask tokens contribute nothing to the encoder. They’re placeholders. Running them through a 24-layer ViT-L is pure waste. So remove them, let the encoder see only the 49 visible patches, then hand everything to a cheap decoder. The encoder is now 4x shorter. The decoder handles the position-aware reconstruction.
“This design greatly reduces computation. In our setting with a high masking ratio, the encoder processes only a small fraction of patches… allowing us to train large models efficiently.”
Does it actually work? What breaks?
ImageNet classification:
| Model | Pre-training | Fine-tuning top-1 | Linear probe |
|---|---|---|---|
| ViT-Base | MAE | 83.1% | 67.8% |
| ViT-Large | MAE | 85.9% | 73.5% |
| ViT-Huge | MAE | 87.8% | 77.2% |
| ViT-Large | MoCo v3 (contrastive) | 84.1% | 76.7% |
| ViT-Large | BEiT | 85.2% | — |
87.8% with ViT-Huge is best-in-class for methods that use only ImageNet-1K data.
The honest nuance: at linear probe, MAE ViT-L (73.5%) loses to MoCo v3 ViT-L (76.7%). Contrastive methods build better frozen features. MAE builds better features for fine-tuning. In practice, fine-tuning is how most people use pre-trained backbones, so MAE wins in the scenarios that matter most.
Transfer learning (downstream tasks, ViT-Large fine-tuned):
| Task | MAE pre-training | Supervised pre-training | Gain |
|---|---|---|---|
| COCO detection (APbox) | 53.3 | 51.2 | +2.1 |
| COCO segmentation (APmask) | 47.2 | 44.8 | +2.4 |
| ADE20K segmentation (mIoU) | 48.1 | 47.4 | +0.7 |
Self-supervised pre-training now beats supervised pre-training for dense vision tasks. That’s the headline result beyond the classification number.
Masking ratio ablation:
| Ratio | Fine-tuning accuracy |
|---|---|
| 40% | 85.3% |
| 60% | 85.7% |
| 75% | 85.9% |
| 85% | 85.7% |
| 90% | 85.1% |
75% is the sweet spot. Too low: the task is trivially solvable by interpolation. Too high: not enough signal left for meaningful encoding.
What doesn’t work:
Linear probe accuracy lags contrastive methods. If you need a frozen backbone (e.g., for a quick downstream classifier without any fine-tuning), contrastive pre-training is likely better.
Reconstruction of raw pixels is noisier than reconstructing discrete tokens (BEiT’s approach). MAE compensates with per-patch normalization, but the target is inherently pixel-level, not semantic-level. For tasks requiring very coarse semantic features, the gap vs. contrastive methods can be larger.
Practitioner notes
If you’re building a vision system and need to pre-train on unlabeled data: MAE is the default recipe. High masking ratio (75%), asymmetric encoder-decoder, reconstruct normalized pixels. The official checkpoints (facebook/vit-mae-{base,large,huge} on Hugging Face) are production-ready starting points.
If your task is dense prediction (detection, segmentation, depth): MAE pre-training beats supervised ImageNet pre-training on every dense task in the paper. Use it.
If you’re label-constrained: MAE’s fine-tuned features are highly label-efficient. With 1% of ImageNet labels for fine-tuning, MAE outperforms training from scratch with 100% of labels.
If you need frozen features (linear probe, k-NN): consider DINO or MoCo v3 instead. MAE is optimized for the fine-tuning path, not frozen feature quality.
The encoder-only-on-visible-patches design pattern has generalized widely: VideoMAE applies it to video (masking 90% of tubes), Audio-MAE to spectrograms, Point-MAE to point clouds. Wherever you have structured, spatially redundant inputs and abundant unlabeled data, this recipe applies directly.
Connections
- masked-language-model — MAE is the direct vision analog of BERT’s masked token prediction
- vision-transformer — MAE pre-trains ViT encoders; both encoder and decoder are transformer-based
- self-supervised-learning — MAE is self-supervised: supervision comes from the image itself, no labels needed
- patch-embeddings — MAE operates on ViT-style patch embeddings; masking and reconstruction happen at the patch level
- pre-training — the encoder becomes a transferable backbone; the decoder is discarded after pre-training
- attention-is-all-you-need — the Transformer architecture underlying both MAE’s encoder and decoder
- simclr-contrastive-learning-visual-representations — the contrastive pre-training alternative; MAE beats it on fine-tuning, loses on linear probe
- dino-self-supervised-vision-transformers — another self-supervised ViT approach using self-distillation instead of reconstruction
- dinov2-learning-robust-visual-features — DINOv2 scales self-supervised ViT pre-training with a patch-level reconstruction term alongside self-distillation
Citation
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022. https://arxiv.org/abs/2111.06377