Geospatial Foundation Models

The Problem

Remote sensing models for land cover, segmentation, and change detection were traditionally trained from scratch per task — small labeled datasets, ImageNet pretraining at best. ImageNet pretraining is a mismatch: photographs of cats and cars don’t transfer well to overhead views of forests and farms. The ML literature meanwhile had moved toward pretraining-then-finetune (BERT, MAE) and the gains were enormous. The question: can we build a single pretrained model for satellite imagery that downstream tasks adapt to, the way BERT serves NLP?

The Key Insight

Satellite imagery has structure that natural images lack: multiple spectral bands (Sentinel-2 has 13), repeat overpasses (a Sentinel-2 image of any place every 5 days), and the global Earth is the training set. Self-supervised pretraining at planet scale, with a masking objective tuned to satellite structure (mask across time, mask across spectral bands), produces an encoder that beats supervised ImageNet pretraining by 7-14 points on every downstream task tested. This is a “foundation model” in the BERT/CLIP sense: pretrain once on a massive unlabeled corpus, fine-tune everywhere.

Mechanism in Plain English

Pretraining corpus: tens of millions of satellite scenes from Sentinel-2 (Earth-observation), Landsat (longer time series), or planet-scale providers. No labels — just imagery.
Pretraining objective: a self-supervised task that exploits satellite structure. Most common is MAE-style masking (mask 75% of patches, reconstruct), with extensions for temporal masking (mask patches across time at the same location) and spectral masking (mask whole spectral groups).
Architecture: ViT-Large or ViT-Huge backbone with patch embeddings extended to handle multi-spectral input (per-band-group embeddings) and temporal input (timestamp embeddings).
Downstream: take the pretrained encoder, add a task-specific head (linear classifier, segmentation decoder, regression head), fine-tune on the small labeled dataset for the specific task (land cover, building footprints, crop yield, etc.).

ASCII Diagram

PRETRAIN (large unlabeled satellite corpus):
                                       
  Multi-spectral, multi-temporal input
            |
  [patches with multi-band + temporal embeddings]
            |
  [MASK 75% across time and spectral groups]
            |
  [ViT-Huge encoder]
            |
  [Decoder reconstructs masked patches]
            |
  Loss: MSE on masked tokens

FINE-TUNE (small labeled corpus, task-specific):

  Single-task input (e.g., RGB+NIR, single timestamp)
            |
  [pretrained encoder, frozen or LoRA-adapted]
            |
  [task head]   <- only this is trained
            |
  Output: land cover label / segmentation mask / crop yield

What’s Clever

The first clever recognition: the right inductive bias for satellite imagery is “treat each pixel as a vector through time and through wavelength,” not “treat it like a 3-channel photograph.” Bands are physically meaningful (NIR vs SWIR vs visible); time is informative (winter vs summer). Pretraining tasks that respect this structure produce vastly better representations than pretraining tasks that ignore it.

The second clever recognition: the Earth is the training set. Unlike NLP corpora that need careful curation, satellite imagery is generated continuously by sensors at planet scale. Sentinel-2 alone produces ~25TB of imagery per day. The pretraining bottleneck is compute, not data.

The third recognition: self-supervised pretraining decouples downstream task development from labeling cost. Labeled satellite data is expensive — every land cover label needs an expert to look at the imagery. With a strong pretrained encoder, downstream tasks need 10-100x fewer labels than from-scratch training.

The fourth recognition: the foundation model paradigm transfers. The same recipe that worked for BERT (pretrain on raw text, fine-tune for everything) and CLIP (contrastive pretrain on image-text pairs, zero-shot transfer) works for remote sensing. SatMAE, Prithvi, DOFA, ScaleMAE all instantiate this recipe with different scaling and masking strategies.

Key Sources

satmae-pretraining-transformers-temporal-multispectral-satellite-imagery — foundational paper for the modern geospatial FM line
mae-masked-autoencoders-scalable-vision-learners — predecessor; SatMAE adapts this to satellite imagery
an-image-is-worth-16x16-words — ViT is the backbone
hidden-markov-map-matching-noise-sparseness
llama-open-efficient-foundation-language-models
sam-2-segment-anything-in-images-and-videos
segment-anything
t2vec-deep-representation-learning-trajectory-similarity

foundation-models — geospatial FMs are an instance of the broader paradigm
self-supervised-learning — pretraining without labels is the core enabler
masked-image-modeling — the dominant SSL task for vision-style FMs
vision-transformer — the standard backbone
transfer-learning — pretrain-then-fine-tune is a transfer pattern
multimodal-embeddings — modern geospatial work extends to image+text+location

Open Questions

How big is enough? SatMAE used ~1M images, Prithvi ~250M. Where does the scaling curve flatten for satellite imagery? Likely later than for natural images, due to lower per-image semantic density.
Sensor heterogeneity: how to pretrain a model that works across Sentinel-2, Landsat, MODIS, Maxar — each with different bands and resolutions? DOFA introduces dynamic-band embeddings; this is an active area.
Cross-domain alignment: how to align satellite imagery embeddings with ground-level photos (street view), text (location names), or other geospatial signals (mobility, climate)? GeoCLIP and successor work attempts this.
Temporal scale: how many timestamps to use? 4-12 in current work; longer time series (climate-scale, 30+ year history) may unlock different downstream applications.
Compute cost: pretraining a planet-scale FM costs $1M+. Distillation and adaptation strategies for downstream users are critical.

ML Wiki

Explorer

Geospatial Foundation Models

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

What’s Clever

Key Sources

Open Questions

Graph View

Table of Contents

Backlinks

ML Wiki

Explorer

Geospatial Foundation Models

The Problem

The Key Insight

Mechanism in Plain English

ASCII Diagram

What’s Clever

Key Sources

Related Concepts

Open Questions

Graph View

Table of Contents

Backlinks