The Problem
Remote sensing models for land cover, segmentation, and change detection were traditionally trained from scratch per task — small labeled datasets, ImageNet pretraining at best. ImageNet pretraining is a mismatch: photographs of cats and cars don’t transfer well to overhead views of forests and farms. The ML literature meanwhile had moved toward pretraining-then-finetune (BERT, MAE) and the gains were enormous. The question: can we build a single pretrained model for satellite imagery that downstream tasks adapt to, the way BERT serves NLP?
The Key Insight
Satellite imagery has structure that natural images lack: multiple spectral bands (Sentinel-2 has 13), repeat overpasses (a Sentinel-2 image of any place every 5 days), and the global Earth is the training set. Self-supervised pretraining at planet scale, with a masking objective tuned to satellite structure (mask across time, mask across spectral bands), produces an encoder that beats supervised ImageNet pretraining by 7-14 points on every downstream task tested. This is a “foundation model” in the BERT/CLIP sense: pretrain once on a massive unlabeled corpus, fine-tune everywhere.
Mechanism in Plain English
- Pretraining corpus: tens of millions of satellite scenes from Sentinel-2 (Earth-observation), Landsat (longer time series), or planet-scale providers. No labels — just imagery.
- Pretraining objective: a self-supervised task that exploits satellite structure. Most common is MAE-style masking (mask 75% of patches, reconstruct), with extensions for temporal masking (mask patches across time at the same location) and spectral masking (mask whole spectral groups).
- Architecture: ViT-Large or ViT-Huge backbone with patch embeddings extended to handle multi-spectral input (per-band-group embeddings) and temporal input (timestamp embeddings).
- Downstream: take the pretrained encoder, add a task-specific head (linear classifier, segmentation decoder, regression head), fine-tune on the small labeled dataset for the specific task (land cover, building footprints, crop yield, etc.).
ASCII Diagram
PRETRAIN (large unlabeled satellite corpus):
Multi-spectral, multi-temporal input
|
[patches with multi-band + temporal embeddings]
|
[MASK 75% across time and spectral groups]
|
[ViT-Huge encoder]
|
[Decoder reconstructs masked patches]
|
Loss: MSE on masked tokens
FINE-TUNE (small labeled corpus, task-specific):
Single-task input (e.g., RGB+NIR, single timestamp)
|
[pretrained encoder, frozen or LoRA-adapted]
|
[task head] <- only this is trained
|
Output: land cover label / segmentation mask / crop yield
What’s Clever
The first clever recognition: the right inductive bias for satellite imagery is “treat each pixel as a vector through time and through wavelength,” not “treat it like a 3-channel photograph.” Bands are physically meaningful (NIR vs SWIR vs visible); time is informative (winter vs summer). Pretraining tasks that respect this structure produce vastly better representations than pretraining tasks that ignore it.
The second clever recognition: the Earth is the training set. Unlike NLP corpora that need careful curation, satellite imagery is generated continuously by sensors at planet scale. Sentinel-2 alone produces ~25TB of imagery per day. The pretraining bottleneck is compute, not data.
The third recognition: self-supervised pretraining decouples downstream task development from labeling cost. Labeled satellite data is expensive — every land cover label needs an expert to look at the imagery. With a strong pretrained encoder, downstream tasks need 10-100x fewer labels than from-scratch training.
The fourth recognition: the foundation model paradigm transfers. The same recipe that worked for BERT (pretrain on raw text, fine-tune for everything) and CLIP (contrastive pretrain on image-text pairs, zero-shot transfer) works for remote sensing. SatMAE, Prithvi, DOFA, ScaleMAE all instantiate this recipe with different scaling and masking strategies.
Key Sources
-
satmae-pretraining-transformers-temporal-multispectral-satellite-imagery — foundational paper for the modern geospatial FM line
-
mae-masked-autoencoders-scalable-vision-learners — predecessor; SatMAE adapts this to satellite imagery
-
an-image-is-worth-16x16-words — ViT is the backbone
Related Concepts
- foundation-models — geospatial FMs are an instance of the broader paradigm
- self-supervised-learning — pretraining without labels is the core enabler
- masked-image-modeling — the dominant SSL task for vision-style FMs
- vision-transformer — the standard backbone
- transfer-learning — pretrain-then-fine-tune is a transfer pattern
- multimodal-embeddings — modern geospatial work extends to image+text+location
Open Questions
- How big is enough? SatMAE used ~1M images, Prithvi ~250M. Where does the scaling curve flatten for satellite imagery? Likely later than for natural images, due to lower per-image semantic density.
- Sensor heterogeneity: how to pretrain a model that works across Sentinel-2, Landsat, MODIS, Maxar — each with different bands and resolutions? DOFA introduces dynamic-band embeddings; this is an active area.
- Cross-domain alignment: how to align satellite imagery embeddings with ground-level photos (street view), text (location names), or other geospatial signals (mobility, climate)? GeoCLIP and successor work attempts this.
- Temporal scale: how many timestamps to use? 4-12 in current work; longer time series (climate-scale, 30+ year history) may unlock different downstream applications.
- Compute cost: pretraining a planet-scale FM costs $1M+. Distillation and adaptation strategies for downstream users are critical.