The Problem

Remote sensing models for land cover, segmentation, and change detection were traditionally trained from scratch per task — small labeled datasets, ImageNet pretraining at best. ImageNet pretraining is a mismatch: photographs of cats and cars don’t transfer well to overhead views of forests and farms. The ML literature meanwhile had moved toward pretraining-then-finetune (BERT, MAE) and the gains were enormous. The question: can we build a single pretrained model for satellite imagery that downstream tasks adapt to, the way BERT serves NLP?

The Key Insight

Satellite imagery has structure that natural images lack: multiple spectral bands (Sentinel-2 has 13), repeat overpasses (a Sentinel-2 image of any place every 5 days), and the global Earth is the training set. Self-supervised pretraining at planet scale, with a masking objective tuned to satellite structure (mask across time, mask across spectral bands), produces an encoder that beats supervised ImageNet pretraining by 7-14 points on every downstream task tested. This is a “foundation model” in the BERT/CLIP sense: pretrain once on a massive unlabeled corpus, fine-tune everywhere.

Mechanism in Plain English

  1. Pretraining corpus: tens of millions of satellite scenes from Sentinel-2 (Earth-observation), Landsat (longer time series), or planet-scale providers. No labels — just imagery.
  2. Pretraining objective: a self-supervised task that exploits satellite structure. Most common is MAE-style masking (mask 75% of patches, reconstruct), with extensions for temporal masking (mask patches across time at the same location) and spectral masking (mask whole spectral groups).
  3. Architecture: ViT-Large or ViT-Huge backbone with patch embeddings extended to handle multi-spectral input (per-band-group embeddings) and temporal input (timestamp embeddings).
  4. Downstream: take the pretrained encoder, add a task-specific head (linear classifier, segmentation decoder, regression head), fine-tune on the small labeled dataset for the specific task (land cover, building footprints, crop yield, etc.).

ASCII Diagram

PRETRAIN (large unlabeled satellite corpus):
                                       
  Multi-spectral, multi-temporal input
            |
  [patches with multi-band + temporal embeddings]
            |
  [MASK 75% across time and spectral groups]
            |
  [ViT-Huge encoder]
            |
  [Decoder reconstructs masked patches]
            |
  Loss: MSE on masked tokens

FINE-TUNE (small labeled corpus, task-specific):

  Single-task input (e.g., RGB+NIR, single timestamp)
            |
  [pretrained encoder, frozen or LoRA-adapted]
            |
  [task head]   <- only this is trained
            |
  Output: land cover label / segmentation mask / crop yield

What’s Clever

The first clever recognition: the right inductive bias for satellite imagery is “treat each pixel as a vector through time and through wavelength,” not “treat it like a 3-channel photograph.” Bands are physically meaningful (NIR vs SWIR vs visible); time is informative (winter vs summer). Pretraining tasks that respect this structure produce vastly better representations than pretraining tasks that ignore it.

The second clever recognition: the Earth is the training set. Unlike NLP corpora that need careful curation, satellite imagery is generated continuously by sensors at planet scale. Sentinel-2 alone produces ~25TB of imagery per day. The pretraining bottleneck is compute, not data.

The third recognition: self-supervised pretraining decouples downstream task development from labeling cost. Labeled satellite data is expensive — every land cover label needs an expert to look at the imagery. With a strong pretrained encoder, downstream tasks need 10-100x fewer labels than from-scratch training.

The fourth recognition: the foundation model paradigm transfers. The same recipe that worked for BERT (pretrain on raw text, fine-tune for everything) and CLIP (contrastive pretrain on image-text pairs, zero-shot transfer) works for remote sensing. SatMAE, Prithvi, DOFA, ScaleMAE all instantiate this recipe with different scaling and masking strategies.

Key Sources

Open Questions

  • How big is enough? SatMAE used ~1M images, Prithvi ~250M. Where does the scaling curve flatten for satellite imagery? Likely later than for natural images, due to lower per-image semantic density.
  • Sensor heterogeneity: how to pretrain a model that works across Sentinel-2, Landsat, MODIS, Maxar — each with different bands and resolutions? DOFA introduces dynamic-band embeddings; this is an active area.
  • Cross-domain alignment: how to align satellite imagery embeddings with ground-level photos (street view), text (location names), or other geospatial signals (mobility, climate)? GeoCLIP and successor work attempts this.
  • Temporal scale: how many timestamps to use? 4-12 in current work; longer time series (climate-scale, 30+ year history) may unlock different downstream applications.
  • Compute cost: pretraining a planet-scale FM costs $1M+. Distillation and adaptation strategies for downstream users are critical.