Back to Journal
Geospatial AI 10 December 2024 12 min read Sheece Gardezi

Clay: One Foundation Model for All Earth Observation

Clay replaces dozens of task-specific remote sensing models with a single pre-trained system. Fine-tune for land cover, change detection, or crop classification with 10x less labeled data.

Machine LearningRemote SensingFoundation ModelsClay
Satellite view of Earth showing cloud patterns and landmasses
NASA on Unsplash

Training a building detection model from scratch requires 10,000+ labeled examples and weeks of GPU time. Fine-tuning Clay -- an open-source geospatial foundation model -- requires a few dozen examples and hours. Pre-trained on massive unlabeled satellite archives using masked autoencoding, Clay produces 768-dimensional embeddings that encode spectral, spatial, and temporal patterns across the Earth's surface. Five competing GeoFMs shipped in 2024. Clay's open-source design and deployment focus make it the most practical for production workflows.

The Labeling Bottleneck That GeoFMs Eliminate

Traditional remote sensing ML follows a brittle pattern: collect thousands of labeled examples for a specific task (building detection, crop mapping, land cover), train from scratch, deploy, and discover the model fails in new geographies. A classifier trained on European agricultural fields rarely transfers to African landscapes.

Foundation models invert this. Self-supervised pre-training on massive unlabeled satellite archives produces rich internal representations of spectral, spatial, and seasonal patterns. Fine-tuning for a downstream task requires far less labeled data -- sometimes a few dozen examples -- because the model already understands what Earth looks like.

GeoFMs offer immediate value without training. They represent an emerging research field and are a type of pre-trained vision transformer specifically adapted to geospatial data sources.
ACM SIGSPATIAL GeoAI 2024

Clay: Open-Source ViT with Masked Autoencoder Pretraining

Clay emerged from the team behind Microsoft's Planetary Computer and operates under Radiant Earth's fiscal sponsorship. The model is fully open-source -- weights, training code, and inference pipeline are all inspectable and modifiable.

The architecture is a Vision Transformer adapted for geospatial and temporal relationships. Training uses a Masked Autoencoder approach: the model predicts masked portions of satellite images, developing robust spectral-spatial feature representations without any labeled data.

Clay's Technical Capabilities

Multi-spectral input

Works with all Sentinel-2 bands, though commonly uses RGB and NIR

Location-aware

Incorporates geographic coordinates as input features

Temporal understanding

Processes time series data to understand seasonal patterns

768-dimensional embeddings

Rich representations for downstream tasks

Flexible inference

Can accept varying image sizes, resolutions, and band combinations

Five Competing GeoFMs Shipped in 2024

Clay operates in a crowded field. Each model takes a different approach to sensor support, pretraining data, and architecture:

The Competitive Landscape

  • Prithvi-100M (IBM/NASA) — Trained on Harmonized Landsat-Sentinel data, strong on climate applications
  • SatMAE — Pioneering work on masked autoencoders for satellite imagery
  • SpectralGPT — Focuses on hyperspectral data with spectral-aware pretraining
  • DOFA — Dynamic One-For-All architecture for multi-sensor fusion
  • SatVision-Base — Microsoft's contribution optimized for high-resolution imagery

Clay's differentiator is practical deployment and similarity search. Clay-powered systems detect emerging deforestation patterns before they expand -- essentially "reverse image search" for the planet.

Limitations: 4-5x Resolution Loss and Multimodal Gaps

GeoFMs are not universally superior. ACM SIGSPATIAL research shows that on multimodal tasks -- fusing satellite imagery with POI data, street-level photos, or tabular attributes -- existing GeoFMs still underperform task-specific models.

Pixel-level precision is the main weakness. Transformer architectures reduce feature resolution 4-5x, sacrificing fine-grained spatial detail needed for precise segmentation or sub-meter change detection. The practical solution: combine GeoFMs with specialized segmentation heads like SAM2 for boundary delineation.

LLM + GeoFM: Natural Language Queries Over Satellite Archives

AWS's geospatial FM service already combines Prithvi with Claude for natural language interaction with satellite archives. Development Seed's semantic search using Clay embeddings enables queries like "locations where solar panel installations increased >20% between 2020 and 2024" -- returning coordinates and explanatory analysis grounded in the imagery.

This convergence makes geospatial analysis accessible to domain experts without ML expertise. The GeoFM produces embeddings; the LLM translates between human intent and vector similarity queries.

GeoFMs Augment Domain Expertise, They Don't Replace It

GeoFMs are the most significant shift in satellite analytics since cloud-native formats. Extracting meaningful features from imagery without manual labeling campaigns changes the economics of the entire field.

The teams that succeed will build on Clay while maintaining rigorous validation against ground truth. GeoFMs eliminate the labeling bottleneck, not the need to understand specific geographies, sensor characteristics, and application requirements.

Have a project in mind?

Location

  • Canberra
    ACT, Australia