UncoverML Geoscience Pipeline
Deployed an ML pipeline on Australia's Gadi supercomputer that reduced continental-scale geological analysis from 3 weeks to 11 hours, distributing workloads across 480 cores and 20 compute nodes.
Client
Geoscience Australia
Key Results
Cores Utilized
Analysis Time
Using Nationally
Lines Migrated to Py3
The Challenge
Geoscience Australia's researchers had continental-scale raster datasets (100+ GB per analysis) but no way to run ML models on them at scale. Single-machine runs took 3 weeks and frequently crashed, blocking mineral exploration and geological mapping projects.
Key challenges included:
- 100+ GB raster datasets crashing on single-machine runs after days of processing
- NCI Gadi environment requiring PBS job scheduling and MPI configuration
- No existing tooling to distribute geoscience ML across 20+ compute nodes
- Researchers needing results integrated with existing GDAL/Rasterio workflows
- Legacy Python 2 codebase (2,800 lines) blocking compatibility with modern ML libraries
National Computational Infrastructure
NCI's Gadi supercomputer delivers 9 petaflops of computing power to Australian researchers. Making geoscience ML workflows run efficiently on Gadi unlocked analyses that were simply impossible on desktop hardware.
Gadi Supercomputer
Multi-petaflop computing power
Massive Datasets
Petabytes of geoscience data
Our Solution
A continental-scale analysis that used to take 3 weeks now completes in 11 hours across 20 nodes. Six research teams use the pipeline nationally, and the Python 3 migration unlocked access to modern scikit-learn, PyTorch, and XGBoost models.
MPI Distribution
Workloads distributed across 20 nodes and 480 cores via MPI, achieving near-linear scaling for feature extraction.
Feature Extraction
Scalable raster pipeline processing 100+ GB datasets in parallel, extracting 47 geophysical features per grid cell.
Hyperparameter Optimization
Grid search across 480 cores evaluated 1,200+ parameter combinations in hours instead of weeks.
Prediction Mapping
Continental-scale prediction maps generated in 11 hours, producing GeoTIFF outputs compatible with standard GIS tools.
Python 3 Migration
Full port of 2,800 lines from Python 2 to 3, unlocking compatibility with modern ML libraries and long-term support.
scikit-learn Integration
Pluggable model interface supporting Random Forest, Gradient Boosting, and SVR, selectable via config file.
Project Impact
Research Acceleration
- Continental-scale analysis reduced from 3 weeks to 11 hours
- Enabled 100+ GB analyses that previously crashed on desktop hardware
- Adopted by 6 research teams across Geoscience Australia and universities
- Directly supports mineral prospectivity mapping and geological surveys
Open Source Contribution
- All improvements contributed upstream to the open-source UncoverML project
- Python 3 migration adopted by the broader research community
- Full documentation and PBS job templates enabling reproducible analyses
- Pipeline now used as the foundation for 3 published geoscience papers