Concepts: self-supervised-learning | vision-transformer | patch-embeddings | geospatial-foundation-models | masked-image-modeling Builds on: mae-masked-autoencoders-scalable-vision-learners — directly extends MAE for satellite-specific structure Builds on: an-image-is-worth-16x16-words — ViT is the backbone Leads to: Prithvi (NASA-IBM 2023), DOFA, and the broader “geospatial foundation model” line
When MAE landed in 2021, it gave self-supervised vision a clean recipe: mask 75% of image patches, ask a small ViT decoder to reconstruct them, transfer the encoder. The recipe worked beautifully on natural images. But satellite imagery has structure that natural images don’t — multiple spectral bands beyond RGB (Sentinel-2 has 13), and dense time series of the same location (a Sentinel-2 image every 5 days, indefinitely). Pretraining a ViT on satellite imagery as if it were ImageNet wastes that structure. SatMAE (Cong et al., NeurIPS 2022) is a small set of changes to MAE that exploits both axes — and the gains transfer downstream by 7-14 points.
The core idea
Two structural modifications to MAE:
-
Temporal MAE. Treat a satellite scene not as a single image but as a stack of images over time. Tokenize each timestamp’s patches normally, but add a temporal positional embedding that tells the encoder which timestamp a patch came from. Mask patches independently across time — so the encoder learns to fill in “what does this patch probably look like in summer if I only see it in winter and spring.”
-
Multi-spectral MAE. Don’t stack 13 spectral bands as a 13-channel image. Group them by physical similarity (visible bands, near-IR, short-wave IR), patch each group separately, and add a per-group spectral positional embedding. The model learns which bands belong together, but also that visible-light tokens and IR tokens are different objects.
The pretraining objective stays the same: mask 75% of patches, reconstruct them, use mean-squared error loss. Only the patch tokenization and positional encoding scheme changes.
Walkthrough
Temporal MAE pretraining flow on a 3-timestamp Sentinel-2 scene:
INPUT: 3 images (winter, spring, summer), each 224x224x10 bands
Use only RGB+NIR groups for this example.
PATCHIFY: 14x14 patches per image -> 196 patches per timestamp.
Total: 3 timestamps x 196 = 588 patch tokens.
POSITIONAL EMBEDS:
spatial: 196 unique positions (shared across timestamps)
temporal: 3 unique positions (one per timestamp)
per-token embedding: spatial + temporal
MASK: keep 25% (147 tokens), drop 75% (441 tokens).
Crucially, masking is INDEPENDENT across timestamps, so the
encoder may see patch-43 in winter and summer but have it
masked in spring -> must learn to "interpolate" seasons.
ENCODE: ViT-Large processes the 147 visible tokens.
DECODE: Small ViT decoder takes the 147 encoded tokens + 441 mask
tokens, predicts the pixel values of the 441 masked patches.
LOSS: MSE on the 441 masked patches.
Multi-spectral MAE pretraining on a single Sentinel-2 scene with 10 bands:
GROUP THE BANDS:
Group A (visible): R, G, B
Group B (red edge): RE1, RE2, RE3
Group C (NIR + SWIR): NIR, NIR-narrow, SWIR1, SWIR2
PATCHIFY EACH GROUP SEPARATELY:
Group A: 14x14 spatial patches -> 196 tokens
Group B: 14x14 -> 196 tokens
Group C: 14x14 -> 196 tokens
Total: 588 patch tokens.
POSITIONAL EMBEDS:
spatial: 196 positions (shared across groups)
spectral: 3 positions (A, B, C)
MASK 75% INDEPENDENTLY across groups.
ENCODE / DECODE / LOSS: as in MAE.
The key consequence: the encoder learns that the same spatial position viewed in different spectral groups is the same place but a different feature — exactly the inductive bias remote sensing analysts use.
What’s clever — find the instinct
The clever recognition: the right inductive bias for satellite imagery isn’t “treat it like a photograph.” It’s “treat each pixel as a vector through time and through wavelength.” Natural images have RGB because human eyes have three cone types. Satellite images have 13 bands because that’s what the sensor measures, and the bands are physically meaningful — vegetation reflects strongly in NIR, water absorbs SWIR, ice reflects visible. Cramming all this into a 3-channel ViT is information loss.
“The inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies.”
The second clever move: learn alignment across time and wavelength jointly with reconstruction. A model that has to reconstruct a winter NIR patch from summer visible bands has implicitly learned the relationship between season and spectral signature. This is exactly the relationship a downstream classifier wants to use.
“Encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial.”
The third clever move: modality-aware masking. Independent masking across time and across spectral groups forces the encoder to learn cross-modal generalization. If you masked all of one timestamp at once, the model would just copy from the other timestamps — easy and useless. By masking independently, the model has to learn the correlation structure.
Does it work? What breaks?
fMoW-Sentinel benchmark (land cover classification):
| Pretraining | Top-1 Acc |
|---|---|
| ImageNet supervised | 56.3% |
| ImageNet MoCo-v3 | 57.0% |
| MAE on satellite (vanilla) | 57.4% |
| SatMAE Temporal | 63.9% (+6.5) |
| SatMAE Spectral | 63.0% (+5.6) |
EuroSAT (transfer to a different sensor):
| Pretraining | Top-1 Acc |
|---|---|
| ImageNet supervised | 80.5% |
| SatMAE | 94.3% (+14 absolute) |
SpaceNet semantic segmentation (mIoU):
| Pretraining | mIoU |
|---|---|
| ImageNet | 49.1 |
| SatMAE | 55.7 (+6.6) |
The transfer-learning gains are the load-bearing result. Pretraining on satellite imagery with the right inductive biases produces an encoder that beats ImageNet pretraining on every downstream remote sensing task tested.
What breaks:
- Domain transfer is sensor-specific. SatMAE pretrained on Sentinel-2 transfers well to other Sentinel-2 tasks. Transfer to commercial satellites (WorldView, Maxar) with different band centers needs adapter layers or re-pretraining.
- No high-resolution support. The paper uses 96-224px images. Modern remote sensing tasks (building footprints, road extraction) often need 512+ px and dynamic resolution.
- No alignment with ground-level imagery. For a system like Saikat’s pipeline (ground-level 360-degree street imagery + satellite context), SatMAE only gives the satellite half. CLIP-style cross-domain alignment is a separate problem.
- Missing data is a bigger problem in practice. Real satellite time series have clouds, sensor failures, gaps. The paper assumes clean stacks.
So what?
SatMAE is the standard “pretrained encoder” for remote sensing in 2024-2025. The Hugging Face SatMAE checkpoints are the default starting point for any geospatial classification or segmentation task. Prithvi (NASA-IBM 2023) extends the recipe to 100M parameters and global pretraining. The “geospatial foundation model” research direction is essentially “SatMAE plus more compute, more data, more bands.”
For practitioners working on geospatial ML — the snap-to-road systems, land cover classification, urban-change detection — the practical rule is: don’t pretrain on ImageNet. Use SatMAE or a sensor-matched descendant. The 6-14 point gains on small downstream datasets are precisely what you want when labeled satellite data is scarce.
For Saikat specifically: when the wiki adds a “satellite + street imagery POI extraction” pipeline, SatMAE is the right encoder for the satellite branch. The street-imagery branch is still a CLIP-style or DINOv2 ViT. Aligning them is the open problem (cross-modal contrastive pretraining, à la GeoCLIP).
“Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to 14%).”
The deeper principle SatMAE demonstrates: self-supervised methods are most effective when the masking strategy reflects the data’s actual structure. MAE masks random spatial patches because natural images have local spatial coherence. SatMAE masks across time because satellite imagery has additional temporal coherence. The same idea applies anywhere: masked language models mask whole words; masked autoencoders for protein sequences mask whole amino acids; masked autoencoders for code mask whole AST nodes. Match the mask to the structure.
Connections
- mae-masked-autoencoders-scalable-vision-learners — direct predecessor; SatMAE is MAE with two domain-specific extensions
- an-image-is-worth-16x16-words — ViT is the encoder backbone
- self-supervised-learning — SatMAE establishes the SSL recipe for remote sensing
- vision-transformer — uses ViT-Large and ViT-Huge as backbones
- patch-embeddings — modified to handle multi-spectral and temporal grouping
- geospatial-foundation-models — SatMAE is the canonical first generation
- masked-image-modeling — SatMAE generalizes the masking strategy beyond spatial-only
Citation
Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D. B., & Ermon, S. (2022). SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. NeurIPS 2022. https://arxiv.org/abs/2207.08051