SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery

Concepts: self-supervised-learning | vision-transformer | patch-embeddings | geospatial-foundation-models | masked-image-modeling Builds on: mae-masked-autoencoders-scalable-vision-learners — directly extends MAE for satellite-specific structure Builds on: an-image-is-worth-16x16-words — ViT is the backbone Leads to: Prithvi (NASA-IBM 2023), DOFA, and the broader “geospatial foundation model” line

When MAE landed in 2021, it gave self-supervised vision a clean recipe: mask 75% of image patches, ask a small ViT decoder to reconstruct them, transfer the encoder. The recipe worked beautifully on natural images. But satellite imagery has structure that natural images don’t — multiple spectral bands beyond RGB (Sentinel-2 has 13), and dense time series of the same location (a Sentinel-2 image every 5 days, indefinitely). Pretraining a ViT on satellite imagery as if it were ImageNet wastes that structure. SatMAE (Cong et al., NeurIPS 2022) is a small set of changes to MAE that exploits both axes — and the gains transfer downstream by 7-14 points.

The core idea

Two structural modifications to MAE:

Temporal MAE. Treat a satellite scene not as a single image but as a stack of images over time. Tokenize each timestamp’s patches normally, but add a temporal positional embedding that tells the encoder which timestamp a patch came from. Mask patches independently across time — so the encoder learns to fill in “what does this patch probably look like in summer if I only see it in winter and spring.”
Multi-spectral MAE. Don’t stack 13 spectral bands as a 13-channel image. Group them by physical similarity (visible bands, near-IR, short-wave IR), patch each group separately, and add a per-group spectral positional embedding. The model learns which bands belong together, but also that visible-light tokens and IR tokens are different objects.

The pretraining objective stays the same: mask 75% of patches, reconstruct them, use mean-squared error loss. Only the patch tokenization and positional encoding scheme changes.

Walkthrough

Temporal MAE pretraining flow on a 3-timestamp Sentinel-2 scene:

INPUT: 3 images (winter, spring, summer), each 224x224x10 bands
       Use only RGB+NIR groups for this example.

PATCHIFY: 14x14 patches per image -> 196 patches per timestamp.
          Total: 3 timestamps x 196 = 588 patch tokens.

POSITIONAL EMBEDS:
  spatial: 196 unique positions (shared across timestamps)
  temporal: 3 unique positions (one per timestamp)
  per-token embedding: spatial + temporal

MASK: keep 25% (147 tokens), drop 75% (441 tokens).
      Crucially, masking is INDEPENDENT across timestamps, so the
      encoder may see patch-43 in winter and summer but have it
      masked in spring -> must learn to "interpolate" seasons.

ENCODE: ViT-Large processes the 147 visible tokens.

DECODE: Small ViT decoder takes the 147 encoded tokens + 441 mask
        tokens, predicts the pixel values of the 441 masked patches.

LOSS: MSE on the 441 masked patches.

Multi-spectral MAE pretraining on a single Sentinel-2 scene with 10 bands:

GROUP THE BANDS:
  Group A (visible): R, G, B
  Group B (red edge): RE1, RE2, RE3
  Group C (NIR + SWIR): NIR, NIR-narrow, SWIR1, SWIR2

PATCHIFY EACH GROUP SEPARATELY:
  Group A: 14x14 spatial patches -> 196 tokens
  Group B: 14x14 -> 196 tokens
  Group C: 14x14 -> 196 tokens
  Total: 588 patch tokens.

POSITIONAL EMBEDS:
  spatial: 196 positions (shared across groups)
  spectral: 3 positions (A, B, C)

MASK 75% INDEPENDENTLY across groups.
ENCODE / DECODE / LOSS: as in MAE.

The key consequence: the encoder learns that the same spatial position viewed in different spectral groups is the same place but a different feature — exactly the inductive bias remote sensing analysts use.

What’s clever — find the instinct

The clever recognition: the right inductive bias for satellite imagery isn’t “treat it like a photograph.” It’s “treat each pixel as a vector through time and through wavelength.” Natural images have RGB because human eyes have three cone types. Satellite images have 13 bands because that’s what the sensor measures, and the bands are physically meaningful — vegetation reflects strongly in NIR, water absorbs SWIR, ice reflects visible. Cramming all this into a 3-channel ViT is information loss.

“The inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies.”

The second clever move: learn alignment across time and wavelength jointly with reconstruction. A model that has to reconstruct a winter NIR patch from summer visible bands has implicitly learned the relationship between season and spectral signature. This is exactly the relationship a downstream classifier wants to use.

“Encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial.”

The third clever move: modality-aware masking. Independent masking across time and across spectral groups forces the encoder to learn cross-modal generalization. If you masked all of one timestamp at once, the model would just copy from the other timestamps — easy and useless. By masking independently, the model has to learn the correlation structure.

Does it work? What breaks?

fMoW-Sentinel benchmark (land cover classification):

Pretraining	Top-1 Acc
ImageNet supervised	56.3%
ImageNet MoCo-v3	57.0%
MAE on satellite (vanilla)	57.4%
SatMAE Temporal	63.9% (+6.5)
SatMAE Spectral	63.0% (+5.6)

EuroSAT (transfer to a different sensor):

Pretraining	Top-1 Acc
ImageNet supervised	80.5%
SatMAE	94.3% (+14 absolute)

SpaceNet semantic segmentation (mIoU):

Pretraining	mIoU
ImageNet	49.1
SatMAE	55.7 (+6.6)

The transfer-learning gains are the load-bearing result. Pretraining on satellite imagery with the right inductive biases produces an encoder that beats ImageNet pretraining on every downstream remote sensing task tested.

What breaks:

Domain transfer is sensor-specific. SatMAE pretrained on Sentinel-2 transfers well to other Sentinel-2 tasks. Transfer to commercial satellites (WorldView, Maxar) with different band centers needs adapter layers or re-pretraining.
No high-resolution support. The paper uses 96-224px images. Modern remote sensing tasks (building footprints, road extraction) often need 512+ px and dynamic resolution.
No alignment with ground-level imagery. For a system like Saikat’s pipeline (ground-level 360-degree street imagery + satellite context), SatMAE only gives the satellite half. CLIP-style cross-domain alignment is a separate problem.
Missing data is a bigger problem in practice. Real satellite time series have clouds, sensor failures, gaps. The paper assumes clean stacks.

So what?

SatMAE is the standard “pretrained encoder” for remote sensing in 2024-2025. The Hugging Face SatMAE checkpoints are the default starting point for any geospatial classification or segmentation task. Prithvi (NASA-IBM 2023) extends the recipe to 100M parameters and global pretraining. The “geospatial foundation model” research direction is essentially “SatMAE plus more compute, more data, more bands.”

For practitioners working on geospatial ML — the snap-to-road systems, land cover classification, urban-change detection — the practical rule is: don’t pretrain on ImageNet. Use SatMAE or a sensor-matched descendant. The 6-14 point gains on small downstream datasets are precisely what you want when labeled satellite data is scarce.

For Saikat specifically: when the wiki adds a “satellite + street imagery POI extraction” pipeline, SatMAE is the right encoder for the satellite branch. The street-imagery branch is still a CLIP-style or DINOv2 ViT. Aligning them is the open problem (cross-modal contrastive pretraining, à la GeoCLIP).

“Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to 14%).”

The deeper principle SatMAE demonstrates: self-supervised methods are most effective when the masking strategy reflects the data’s actual structure. MAE masks random spatial patches because natural images have local spatial coherence. SatMAE masks across time because satellite imagery has additional temporal coherence. The same idea applies anywhere: masked language models mask whole words; masked autoencoders for protein sequences mask whole amino acids; masked autoencoders for code mask whole AST nodes. Match the mask to the structure.

Connections

mae-masked-autoencoders-scalable-vision-learners — direct predecessor; SatMAE is MAE with two domain-specific extensions
an-image-is-worth-16x16-words — ViT is the encoder backbone
self-supervised-learning — SatMAE establishes the SSL recipe for remote sensing
vision-transformer — uses ViT-Large and ViT-Huge as backbones
patch-embeddings — modified to handle multi-spectral and temporal grouping
geospatial-foundation-models — SatMAE is the canonical first generation
masked-image-modeling — SatMAE generalizes the masking strategy beyond spatial-only

Citation

arXiv:2207.08051

Cong, Y., Khanna, S., Meng, C., Liu, P., Rozi, E., He, Y., Burke, M., Lobell, D. B., & Ermon, S. (2022). SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery. NeurIPS 2022. https://arxiv.org/abs/2207.08051

ML Wiki

Explorer