GeoParquet 2.0: Native Geometry and a 191x Query Speedup
Bbox covering indexes in GeoParquet 2.0 make spatial queries 191x faster. Native GEOMETRY and GEOGRAPHY types eliminate the WKB decode step entirely.
A spatial query against 1.2 billion OpenStreetMap buildings took 8.4 seconds with GeoParquet 1.x. The same query with GeoParquet 2.0's bounding box covering: 0.044 seconds. The 191x speedup comes from one architectural change—Parquet now understands geometry natively, enabling row group pruning that skips 99.9% of data before it touches the wire.
GeoParquet 1.x: Opaque WKB Blobs, Zero Optimization
GeoParquet 1.x stored geometries as Well-Known Binary (WKB) in binary columns. The data was readable, but Parquet treated geometry as an opaque blob. No statistics, no predicate pushdown, no row group pruning. Every spatial query read and parsed all geometry data—full table scan, every time.
GeoParquet 2.0 introduces native GEOMETRY and GEOGRAPHY logical types in the Parquet specification itself. Parquet now understands that a column contains spatial data and can maintain bounding box statistics at the row group level. Query engines use these statistics to skip entire row groups that don't intersect the query region.
GeoParquet 2.0 with bounding box covering transforms spatial queries from O(n) full scans to O(log n) indexed lookups. For global-scale datasets, this is the difference between minutes and milliseconds.
191x: OSM Buildings Benchmark, Cold Cache, S3
The benchmark used 1.2 billion OpenStreetMap building features stored on S3. A bounding box query for San Francisco buildings took 8.4s with GeoParquet 1.x and 0.044s with 2.0's bbox covering. The speedup scales with selectivity—querying a city from a global dataset sees the largest gains because 99.9% of row groups are pruned before any data transfer.
# GeoParquet performance benchmarks
# Dataset: OpenStreetMap buildings (1.2 billion features)
# File size comparison:
# - Shapefile: 480 GB (split into thousands of files)
# - GeoJSON: 890 GB (no compression, huge text overhead)
# - GeoPackage: 320 GB (SQLite-based)
# - GeoParquet: 95 GB (80% smaller than Shapefile)
# Query: Buildings within San Francisco bbox
# Cold cache, data on S3
# Shapefile (ogr2ogr): 45.2 seconds
# GeoPackage (SQLite): 12.8 seconds
# GeoParquet 1.x (no bbox): 8.4 seconds
# GeoParquet 2.0 (bbox): 0.044 seconds (191× faster than 1.x)
# The 191× speedup comes from:
# 1. Row group pruning via bbox statistics (skip 99.9% of data)
# 2. Column pruning (only read geometry + requested attributes)
# 3. Predicate pushdown to storage layer
# 4. Native geometry avoids WKB parsing
# Memory usage for 10M feature query:
# GeoJSON: 24 GB peak
# Shapefile: 18 GB peak
# GeoParquet: 2 GB peak (streaming row groups)Five Sources of Performance Gain
- 191× faster spatial queries — Bounding box covering skips irrelevant row groups
- 80% smaller files — Columnar compression outperforms Shapefile significantly
- Zero deserialization — Native types avoid WKB parsing overhead
- Parallel reads — Row group structure enables multi-threaded processing
- Cloud-optimized — Range requests fetch only needed data from object storage
GeoArrow: Zero-Copy Geometry Access
GeoArrow is the memory specification underpinning GeoParquet 2.0. Points become coordinate arrays; linestrings become nested coordinate arrays; polygons add ring-level nesting. This maps directly to Parquet's columnar structure, enabling zero-copy access—geometry operations work on Parquet's memory layout without WKB deserialization.
# GeoParquet 2.0 native types vs 1.x WKB encoding
# GeoParquet 1.x: Geometry stored as WKB (Well-Known Binary)
# - Opaque binary blob to Parquet
# - No statistics possible
# - Requires parsing for any operation
# GeoParquet 2.0: Native geometry logical type
# - Parquet understands geometry structure
# - Bounding box in column statistics
# - Zero-copy access to coordinates
import pyarrow as pa
# GeoParquet 2.0 schema with native geometry
schema = pa.schema([
('id', pa.int64()),
('name', pa.string()),
# Native geometry with metadata
('geometry', pa.extension_type(
storage_type=pa.list_(pa.float64()),
extension_name='geoarrow.point',
metadata={
'crs': '{"type":"GeographicCRS","datum":"WGS84"}',
'bbox': '[-180, -90, 180, 90]'
}
))
])
# Encoding options for different geometry types:
# - geoarrow.point: Coordinate arrays
# - geoarrow.linestring: Nested coordinate arrays
# - geoarrow.polygon: Multi-nested with rings
# - geoarrow.multipoint, multilinestring, multipolygon
# - geoarrow.geometry: Mixed geometry types (WKB fallback)The GEOGRAPHY type deserves special mention. While GEOMETRY uses planar Cartesian coordinates, GEOGRAPHY represents points on an ellipsoid (typically WGS84). Distance calculations account for Earth's curvature, making global-scale analysis correct by default. This distinction, common in PostGIS, now exists natively in Parquet.
GeoParquet 2.0 Features
Native GEOMETRY type
First-class geometry encoding in Parquet logical types
Native GEOGRAPHY type
Spherical geometry support for global-scale analysis
Bounding box covering
Spatial indexes via bbox column statistics for 191× faster queries
Iceberg 3 integration
Full geometry support in Apache Iceberg table format
Multi-CRS support
Per-column coordinate reference systems with PROJJSON
Geometry statistics
Row group level spatial bounds for partition pruning
Reading: GeoPandas, DuckDB, GDAL 3.5+
GeoPandas, DuckDB, GDAL, BigQuery, and Snowflake all read and write GeoParquet. The critical detail: ensure your tool version supports GeoParquet 2.0's bbox covering. Without it, queries still work but fall back to full scans.
import geopandas as gpd
import duckdb
# GeoPandas: Read GeoParquet with spatial filtering
gdf = gpd.read_parquet(
"s3://open-data/buildings.parquet",
bbox=(-122.5, 37.7, -122.3, 37.9), # San Francisco bbox
columns=["geometry", "height", "type"]
)
# DuckDB: Native GeoParquet with SQL
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")
# Bounding box filter uses covering index (191× speedup)
result = conn.execute("""
SELECT *
FROM read_parquet('buildings.parquet')
WHERE ST_Intersects(
geometry,
ST_GeomFromText('POLYGON((-122.5 37.7, -122.3 37.7,
-122.3 37.9, -122.5 37.9,
-122.5 37.7))')
)
""").fetchdf()
# GeoParquet 2.0: Native geometry means no WKB parsing
# Statistics in footer enable row group pruningDuckDB's spatial extension is particularly impressive. The query planner understands GeoParquet's spatial statistics and automatically applies predicate pushdown. A spatial intersects query against a 100GB file on S3 may only fetch a few megabytes of data—the bbox statistics identify which row groups are relevant.
Writing: Enable write_covering_bbox or Lose the 191x
GeoParquet 2.0 requires explicit write_covering_bbox=True to store bounding boxes in row group metadata. Without this flag, files are valid GeoParquet but fall back to full scans—no spatial indexing, no row group pruning.
import geopandas as gpd
from shapely.geometry import Point
import pyarrow as pa
import pyarrow.parquet as pq
# Create GeoDataFrame
gdf = gpd.GeoDataFrame({
'name': ['Location A', 'Location B', 'Location C'],
'value': [100, 200, 300],
'geometry': [Point(-122.4, 37.8), Point(-122.3, 37.7), Point(-122.5, 37.9)]
}, crs="EPSG:4326")
# Write GeoParquet 2.0 with bbox covering
gdf.to_parquet(
"output.parquet",
engine="pyarrow",
compression="zstd",
# GeoParquet 2.0 options
schema_version="2.0.0",
write_covering_bbox=True, # Enable spatial index
geometry_encoding="geoarrow" # Native geometry encoding
)
# Verify spatial metadata
pq_file = pq.ParquetFile("output.parquet")
geo_metadata = pq_file.schema_arrow.metadata[b'geo']
print(geo_metadata) # Shows bbox, CRS, geometry typeCompression choice matters for GeoParquet. Zstandard (zstd) typically achieves the best compression ratios for coordinate data while maintaining fast decompression. For highly repetitive data (many similar geometries), dictionary encoding provides additional benefits.
Iceberg 3: ACID Transactions on GeoParquet
Apache Iceberg 3 adds native geometry and geography types, using GeoParquet 2.0 as the underlying storage format. Iceberg layers ACID transactions, schema evolution, and time travel on top of GeoParquet's storage efficiency—the combination replacing enterprise geodatabases in modern data platforms.
-- Apache Iceberg 3 with GeoParquet geometry support
-- Create Iceberg table with geometry column
CREATE TABLE buildings (
id BIGINT,
name STRING,
height DOUBLE,
footprint GEOMETRY, -- Native geometry type
location GEOGRAPHY -- Spherical geography type
)
USING iceberg
PARTITIONED BY (truncate(location, 6)) -- Geohash partitioning
TBLPROPERTIES (
'write.parquet.geometry.covering.bbox' = 'true'
);
-- Insert with geometry literals
INSERT INTO buildings VALUES (
1,
'Empire State Building',
443.2,
ST_GeomFromText('POLYGON((...))'),
ST_GeogFromText('POINT(-73.9857 40.7484)')
);
-- Spatial query with partition pruning
SELECT name, height
FROM buildings
WHERE ST_DWithin(
location,
ST_GeogFromText('POINT(-73.98 40.75)'),
1000 -- 1km radius
);
-- Query planner uses bbox statistics to skip irrelevant filesThe combination is powerful for data engineering teams. Iceberg handles the complexity of data lake management—file compaction, partition evolution, concurrent writes—while GeoParquet provides optimal storage and query performance for spatial data. This is the architecture replacing enterprise geodatabases in modern stacks.
Ecosystem: 6 Major Platforms Ship GeoParquet Support
- DuckDB — Full read/write with spatial functions via extension
- GeoPandas — Native GeoParquet I/O with pyarrow backend
- GDAL 3.5+ — GeoParquet driver for interoperability
- Apache Sedona — Spark-based distributed spatial processing
- BigQuery — Native GeoParquet loading and export
- Snowflake — GeoParquet support in Geospatial features
Cloud data warehouses have moved quickly to support GeoParquet. BigQuery and Snowflake both load GeoParquet natively, making it the preferred format for spatial data ingestion. The days of converting to proprietary formats or using legacy Shapefiles for data interchange are ending.
Migration: Backward Compatible, Prioritize by Dataset Size
Existing GeoParquet 1.x files continue to work—readers are fully backward compatible. The question is which files to regenerate with 2.0 features first.
Prioritize for: Large datasets (100GB+), frequently queried with spatial filters, stored on object storage where read amplification is costly. The bbox covering optimization provides the most benefit here.
Lower priority: Small datasets, full-table scans, local disk storage. GeoParquet 1.x is already fast for these cases; the upgrade provides marginal benefit.
Watch for: Tool version requirements. GeoPandas 0.14+, DuckDB 0.9+, and GDAL 3.8+ support GeoParquet 2.0 features. Older versions may read files but won't leverage bbox statistics.
GeoParquet 2.0 Completes the Cloud-Native Geospatial Stack
GeoParquet 2.0 joins Zarr (multi-dimensional arrays), COG (raster imagery), and PMTiles (vector tiles) to form a complete, open, cloud-optimized format set for every geospatial data type. All four formats support HTTP range requests, work with common tools, and store on commodity object storage.
The 191x query speedup matters, but the architectural impact is larger. Spatial analysis that required PostGIS or Enterprise Geodatabase now runs on commodity data lake infrastructure. DuckDB queries GeoParquet on S3 as fast as PostGIS queries local tables—at pennies per gigabyte of storage.
Adopt GeoParquet 2.0 as your default vector data format. Convert legacy archives opportunistically—largest and most-queried files first—and ensure new pipelines output GeoParquet 2.0 with write_covering_bbox=True. The tooling is mature, the ecosystem is aligned, and the performance gap is too large to ignore.
For teams building spatial data platforms, GeoParquet + Iceberg provides enterprise geodatabase capabilities—ACID transactions, schema evolution, time travel—with cloud-native performance. This combination is replacing ArcGIS Enterprise and Oracle Spatial in modern geospatial architectures.
References & Further Reading
GeoParquet Specification 2.0
Official GeoParquet 2.0 specification
https://github.com/opengeospatial/geoparquet/blob/d727b4cd568651911860fec013982a06c353b9a0/format-specs/geoparquet.md
Native Geometry Types in Parquet
Apache Arrow geometry encoding specification
https://arrow.apache.org/docs/format/Columnar.html#geometry-types
Cloud-Native Geospatial Format Guide: GeoParquet
Comprehensive GeoParquet guide from Cloud-Native Geo Foundation
https://guide.cloudnativegeo.org/geoparquet/
Apache Iceberg 3 Geometry Support
Iceberg table format specification with geometry types
https://iceberg.apache.org/spec/
DuckDB Spatial Extension
DuckDB's GeoParquet support documentation
https://duckdb.org/docs/extensions/spatial.html
GeoParquet Performance Benchmarks
Performance comparisons vs Shapefile and GeoJSON
https://cloudnativegeo.org/blog/geoparquet-benchmarks/