Clay: One Foundation Model for All Earth Observation
Clay replaces dozens of task-specific remote sensing models with a single pre-trained system. Fine-tune for land cover, change detection, or crop classification with 10x less labeled data.
Training a building detection model from scratch requires 10,000+ labeled examples and weeks of GPU time. Fine-tuning Clay -- an open-source geospatial foundation model -- requires a few dozen examples and hours. Pre-trained on massive unlabeled satellite archives using masked autoencoding, Clay produces 768-dimensional embeddings that encode spectral, spatial, and temporal patterns across the Earth's surface. Five competing GeoFMs shipped in 2024. Clay's open-source design and deployment focus make it the most practical for production workflows.
The Labeling Bottleneck That GeoFMs Eliminate
Traditional remote sensing ML follows a brittle pattern: collect thousands of labeled examples for a specific task (building detection, crop mapping, land cover), train from scratch, deploy, and discover the model fails in new geographies. A classifier trained on European agricultural fields rarely transfers to African landscapes.
Foundation models invert this. Self-supervised pre-training on massive unlabeled satellite archives produces rich internal representations of spectral, spatial, and seasonal patterns. Fine-tuning for a downstream task requires far less labeled data -- sometimes a few dozen examples -- because the model already understands what Earth looks like.
GeoFMs offer immediate value without training. They represent an emerging research field and are a type of pre-trained vision transformer specifically adapted to geospatial data sources.
Clay: Open-Source ViT with Masked Autoencoder Pretraining
Clay emerged from the team behind Microsoft's Planetary Computer and operates under Radiant Earth's fiscal sponsorship. The model is fully open-source -- weights, training code, and inference pipeline are all inspectable and modifiable.
The architecture is a Vision Transformer adapted for geospatial and temporal relationships. Training uses a Masked Autoencoder approach: the model predicts masked portions of satellite images, developing robust spectral-spatial feature representations without any labeled data.
Clay's Technical Capabilities
Multi-spectral input
Works with all Sentinel-2 bands, though commonly uses RGB and NIR
Location-aware
Incorporates geographic coordinates as input features
Temporal understanding
Processes time series data to understand seasonal patterns
768-dimensional embeddings
Rich representations for downstream tasks
Flexible inference
Can accept varying image sizes, resolutions, and band combinations
Five Competing GeoFMs Shipped in 2024
Clay operates in a crowded field. Each model takes a different approach to sensor support, pretraining data, and architecture:
The Competitive Landscape
- Prithvi-100M (IBM/NASA) — Trained on Harmonized Landsat-Sentinel data, strong on climate applications
- SatMAE — Pioneering work on masked autoencoders for satellite imagery
- SpectralGPT — Focuses on hyperspectral data with spectral-aware pretraining
- DOFA — Dynamic One-For-All architecture for multi-sensor fusion
- SatVision-Base — Microsoft's contribution optimized for high-resolution imagery
Clay's differentiator is practical deployment and similarity search. Clay-powered systems detect emerging deforestation patterns before they expand -- essentially "reverse image search" for the planet.
Limitations: 4-5x Resolution Loss and Multimodal Gaps
GeoFMs are not universally superior. ACM SIGSPATIAL research shows that on multimodal tasks -- fusing satellite imagery with POI data, street-level photos, or tabular attributes -- existing GeoFMs still underperform task-specific models.
Pixel-level precision is the main weakness. Transformer architectures reduce feature resolution 4-5x, sacrificing fine-grained spatial detail needed for precise segmentation or sub-meter change detection. The practical solution: combine GeoFMs with specialized segmentation heads like SAM2 for boundary delineation.
LLM + GeoFM: Natural Language Queries Over Satellite Archives
AWS's geospatial FM service already combines Prithvi with Claude for natural language interaction with satellite archives. Development Seed's semantic search using Clay embeddings enables queries like "locations where solar panel installations increased >20% between 2020 and 2024" -- returning coordinates and explanatory analysis grounded in the imagery.
This convergence makes geospatial analysis accessible to domain experts without ML expertise. The GeoFM produces embeddings; the LLM translates between human intent and vector similarity queries.
GeoFMs Augment Domain Expertise, They Don't Replace It
GeoFMs are the most significant shift in satellite analytics since cloud-native formats. Extracting meaningful features from imagery without manual labeling campaigns changes the economics of the entire field.
The teams that succeed will build on Clay while maintaining rigorous validation against ground truth. GeoFMs eliminate the labeling bottleneck, not the need to understand specific geographies, sensor characteristics, and application requirements.
References & Further Reading
Clay Foundation Model Documentation
Official Clay model documentation and API reference
https://clay-foundation.github.io/model/index.html
Using Foundation Models for Earth Observation
Development Seed's guide to GeoFM applications
https://developmentseed.org/blog/2024-11-01-geofm/
On the Opportunities and Challenges of Foundation Models for GeoAI
Comprehensive academic review of GeoFM capabilities and limitations
https://arxiv.org/abs/2304.06798
Revolutionizing Earth Observation with Geospatial Foundation Models on AWS
AWS implementation guide for production GeoFM deployment
https://aws.amazon.com/blogs/machine-learning/revolutionizing-earth-observation-with-geospatial-foundation-models-on-aws/
GeoAI Unpacked: EO Foundation Models
Practical overview of the GeoFM ecosystem
https://geoaiunpacked.substack.com/p/geoai-unpacked-1-eo-foundation-models