Geospatial 21 December 2024 12 min read Sheece Gardezi

GeoParquet 2.0: Native Geometry and a 191x Query Speedup

Bbox covering indexes in GeoParquet 2.0 make spatial queries 191x faster. Native GEOMETRY and GEOGRAPHY types eliminate the WKB decode step entirely.

GeoParquetParquetGeospatialCloud NativeApache Iceberg

Earth from space showing city lights and data connections — NASA on Unsplash

A spatial query against 1.2 billion OpenStreetMap buildings took 8.4 seconds with GeoParquet 1.x. The same query with GeoParquet 2.0's bounding box covering: 0.044 seconds. The 191x speedup comes from one architectural change—Parquet now understands geometry natively, enabling row group pruning that skips 99.9% of data before it touches the wire.

GeoParquet 1.x: Opaque WKB Blobs, Zero Optimization

GeoParquet 1.x stored geometries as Well-Known Binary (WKB) in binary columns. The data was readable, but Parquet treated geometry as an opaque blob. No statistics, no predicate pushdown, no row group pruning. Every spatial query read and parsed all geometry data—full table scan, every time.

GeoParquet 2.0 introduces native GEOMETRY and GEOGRAPHY logical types in the Parquet specification itself. Parquet now understands that a column contains spatial data and can maintain bounding box statistics at the row group level. Query engines use these statistics to skip entire row groups that don't intersect the query region.

GeoParquet 2.0 with bounding box covering transforms spatial queries from O(n) full scans to O(log n) indexed lookups. For global-scale datasets, this is the difference between minutes and milliseconds.

Cloud-Native Geospatial Foundation

191x: OSM Buildings Benchmark, Cold Cache, S3

The benchmark used 1.2 billion OpenStreetMap building features stored on S3. A bounding box query for San Francisco buildings took 8.4s with GeoParquet 1.x and 0.044s with 2.0's bbox covering. The speedup scales with selectivity—querying a city from a global dataset sees the largest gains because 99.9% of row groups are pruned before any data transfer.

performance_benchmarks.py

# GeoParquet performance benchmarks
# Dataset: OpenStreetMap buildings (1.2 billion features)

# File size comparison:
# - Shapefile:    480 GB (split into thousands of files)
# - GeoJSON:      890 GB (no compression, huge text overhead)
# - GeoPackage:   320 GB (SQLite-based)
# - GeoParquet:    95 GB (80% smaller than Shapefile)

# Query: Buildings within San Francisco bbox
# Cold cache, data on S3

# Shapefile (ogr2ogr):       45.2 seconds
# GeoPackage (SQLite):       12.8 seconds
# GeoParquet 1.x (no bbox):   8.4 seconds
# GeoParquet 2.0 (bbox):      0.044 seconds  (191× faster than 1.x)

# The 191× speedup comes from:
# 1. Row group pruning via bbox statistics (skip 99.9% of data)
# 2. Column pruning (only read geometry + requested attributes)
# 3. Predicate pushdown to storage layer
# 4. Native geometry avoids WKB parsing

# Memory usage for 10M feature query:
# GeoJSON:    24 GB peak
# Shapefile:  18 GB peak
# GeoParquet:  2 GB peak  (streaming row groups)

Five Sources of Performance Gain

191× faster spatial queries — Bounding box covering skips irrelevant row groups
80% smaller files — Columnar compression outperforms Shapefile significantly
Zero deserialization — Native types avoid WKB parsing overhead
Parallel reads — Row group structure enables multi-threaded processing
Cloud-optimized — Range requests fetch only needed data from object storage

GeoArrow: Zero-Copy Geometry Access

GeoArrow is the memory specification underpinning GeoParquet 2.0. Points become coordinate arrays; linestrings become nested coordinate arrays; polygons add ring-level nesting. This maps directly to Parquet's columnar structure, enabling zero-copy access—geometry operations work on Parquet's memory layout without WKB deserialization.

native_geometry_types.py

# GeoParquet 2.0 native types vs 1.x WKB encoding

# GeoParquet 1.x: Geometry stored as WKB (Well-Known Binary)
# - Opaque binary blob to Parquet
# - No statistics possible
# - Requires parsing for any operation

# GeoParquet 2.0: Native geometry logical type
# - Parquet understands geometry structure
# - Bounding box in column statistics
# - Zero-copy access to coordinates

import pyarrow as pa

# GeoParquet 2.0 schema with native geometry
schema = pa.schema([
    ('id', pa.int64()),
    ('name', pa.string()),
    # Native geometry with metadata
    ('geometry', pa.extension_type(
        storage_type=pa.list_(pa.float64()),
        extension_name='geoarrow.point',
        metadata={
            'crs': '{"type":"GeographicCRS","datum":"WGS84"}',
            'bbox': '[-180, -90, 180, 90]'
        }
    ))
])

# Encoding options for different geometry types:
# - geoarrow.point: Coordinate arrays
# - geoarrow.linestring: Nested coordinate arrays
# - geoarrow.polygon: Multi-nested with rings
# - geoarrow.multipoint, multilinestring, multipolygon
# - geoarrow.geometry: Mixed geometry types (WKB fallback)

The GEOGRAPHY type deserves special mention. While GEOMETRY uses planar Cartesian coordinates, GEOGRAPHY represents points on an ellipsoid (typically WGS84). Distance calculations account for Earth's curvature, making global-scale analysis correct by default. This distinction, common in PostGIS, now exists natively in Parquet.

GeoParquet 2.0 Features

Native GEOMETRY type

First-class geometry encoding in Parquet logical types

Native GEOGRAPHY type

Spherical geometry support for global-scale analysis

Bounding box covering

Spatial indexes via bbox column statistics for 191× faster queries

Iceberg 3 integration

Full geometry support in Apache Iceberg table format

Multi-CRS support

Per-column coordinate reference systems with PROJJSON

Geometry statistics

Row group level spatial bounds for partition pruning

Reading: GeoPandas, DuckDB, GDAL 3.5+

GeoPandas, DuckDB, GDAL, BigQuery, and Snowflake all read and write GeoParquet. The critical detail: ensure your tool version supports GeoParquet 2.0's bbox covering. Without it, queries still work but fall back to full scans.

read_geoparquet.py

import geopandas as gpd
import duckdb

# GeoPandas: Read GeoParquet with spatial filtering
gdf = gpd.read_parquet(
    "s3://open-data/buildings.parquet",
    bbox=(-122.5, 37.7, -122.3, 37.9),  # San Francisco bbox
    columns=["geometry", "height", "type"]
)

# DuckDB: Native GeoParquet with SQL
conn = duckdb.connect()
conn.execute("INSTALL spatial; LOAD spatial;")

# Bounding box filter uses covering index (191× speedup)
result = conn.execute("""
    SELECT *
    FROM read_parquet('buildings.parquet')
    WHERE ST_Intersects(
        geometry,
        ST_GeomFromText('POLYGON((-122.5 37.7, -122.3 37.7,
                                   -122.3 37.9, -122.5 37.9,
                                   -122.5 37.7))')
    )
""").fetchdf()

# GeoParquet 2.0: Native geometry means no WKB parsing
# Statistics in footer enable row group pruning

DuckDB's spatial extension is particularly impressive. The query planner understands GeoParquet's spatial statistics and automatically applies predicate pushdown. A spatial intersects query against a 100GB file on S3 may only fetch a few megabytes of data—the bbox statistics identify which row groups are relevant.

Writing: Enable write_covering_bbox or Lose the 191x

GeoParquet 2.0 requires explicit write_covering_bbox=True to store bounding boxes in row group metadata. Without this flag, files are valid GeoParquet but fall back to full scans—no spatial indexing, no row group pruning.

write_geoparquet.py

import geopandas as gpd
from shapely.geometry import Point
import pyarrow as pa
import pyarrow.parquet as pq

# Create GeoDataFrame
gdf = gpd.GeoDataFrame({
    'name': ['Location A', 'Location B', 'Location C'],
    'value': [100, 200, 300],
    'geometry': [Point(-122.4, 37.8), Point(-122.3, 37.7), Point(-122.5, 37.9)]
}, crs="EPSG:4326")

# Write GeoParquet 2.0 with bbox covering
gdf.to_parquet(
    "output.parquet",
    engine="pyarrow",
    compression="zstd",
    # GeoParquet 2.0 options
    schema_version="2.0.0",
    write_covering_bbox=True,  # Enable spatial index
    geometry_encoding="geoarrow"  # Native geometry encoding
)

# Verify spatial metadata
pq_file = pq.ParquetFile("output.parquet")
geo_metadata = pq_file.schema_arrow.metadata[b'geo']
print(geo_metadata)  # Shows bbox, CRS, geometry type

Compression choice matters for GeoParquet. Zstandard (zstd) typically achieves the best compression ratios for coordinate data while maintaining fast decompression. For highly repetitive data (many similar geometries), dictionary encoding provides additional benefits.

Iceberg 3: ACID Transactions on GeoParquet

Apache Iceberg 3 adds native geometry and geography types, using GeoParquet 2.0 as the underlying storage format. Iceberg layers ACID transactions, schema evolution, and time travel on top of GeoParquet's storage efficiency—the combination replacing enterprise geodatabases in modern data platforms.

iceberg_geometry.sql

-- Apache Iceberg 3 with GeoParquet geometry support

-- Create Iceberg table with geometry column
CREATE TABLE buildings (
    id BIGINT,
    name STRING,
    height DOUBLE,
    footprint GEOMETRY,  -- Native geometry type
    location GEOGRAPHY   -- Spherical geography type
)
USING iceberg
PARTITIONED BY (truncate(location, 6))  -- Geohash partitioning
TBLPROPERTIES (
    'write.parquet.geometry.covering.bbox' = 'true'
);

-- Insert with geometry literals
INSERT INTO buildings VALUES (
    1,
    'Empire State Building',
    443.2,
    ST_GeomFromText('POLYGON((...))'),
    ST_GeogFromText('POINT(-73.9857 40.7484)')
);

-- Spatial query with partition pruning
SELECT name, height
FROM buildings
WHERE ST_DWithin(
    location,
    ST_GeogFromText('POINT(-73.98 40.75)'),
    1000  -- 1km radius
);
-- Query planner uses bbox statistics to skip irrelevant files

The combination is powerful for data engineering teams. Iceberg handles the complexity of data lake management—file compaction, partition evolution, concurrent writes—while GeoParquet provides optimal storage and query performance for spatial data. This is the architecture replacing enterprise geodatabases in modern stacks.

Ecosystem: 6 Major Platforms Ship GeoParquet Support

DuckDB — Full read/write with spatial functions via extension
GeoPandas — Native GeoParquet I/O with pyarrow backend
GDAL 3.5+ — GeoParquet driver for interoperability
Apache Sedona — Spark-based distributed spatial processing
BigQuery — Native GeoParquet loading and export
Snowflake — GeoParquet support in Geospatial features

Cloud data warehouses have moved quickly to support GeoParquet. BigQuery and Snowflake both load GeoParquet natively, making it the preferred format for spatial data ingestion. The days of converting to proprietary formats or using legacy Shapefiles for data interchange are ending.

Migration: Backward Compatible, Prioritize by Dataset Size

Existing GeoParquet 1.x files continue to work—readers are fully backward compatible. The question is which files to regenerate with 2.0 features first.

Prioritize for: Large datasets (100GB+), frequently queried with spatial filters, stored on object storage where read amplification is costly. The bbox covering optimization provides the most benefit here.

Lower priority: Small datasets, full-table scans, local disk storage. GeoParquet 1.x is already fast for these cases; the upgrade provides marginal benefit.

Watch for: Tool version requirements. GeoPandas 0.14+, DuckDB 0.9+, and GDAL 3.8+ support GeoParquet 2.0 features. Older versions may read files but won't leverage bbox statistics.

GeoParquet 2.0 Completes the Cloud-Native Geospatial Stack

GeoParquet 2.0 joins Zarr (multi-dimensional arrays), COG (raster imagery), and PMTiles (vector tiles) to form a complete, open, cloud-optimized format set for every geospatial data type. All four formats support HTTP range requests, work with common tools, and store on commodity object storage.

The 191x query speedup matters, but the architectural impact is larger. Spatial analysis that required PostGIS or Enterprise Geodatabase now runs on commodity data lake infrastructure. DuckDB queries GeoParquet on S3 as fast as PostGIS queries local tables—at pennies per gigabyte of storage.

Adopt GeoParquet 2.0 as your default vector data format. Convert legacy archives opportunistically—largest and most-queried files first—and ensure new pipelines output GeoParquet 2.0 with write_covering_bbox=True. The tooling is mature, the ecosystem is aligned, and the performance gap is too large to ignore.

For teams building spatial data platforms, GeoParquet + Iceberg provides enterprise geodatabase capabilities—ACID transactions, schema evolution, time travel—with cloud-native performance. This combination is replacing ArcGIS Enterprise and Oracle Spatial in modern geospatial architectures.

References & Further Reading

GeoParquet Specification 2.0

Official GeoParquet 2.0 specification

https://github.com/opengeospatial/geoparquet/blob/d727b4cd568651911860fec013982a06c353b9a0/format-specs/geoparquet.md

Native Geometry Types in Parquet

Apache Arrow geometry encoding specification

https://arrow.apache.org/docs/format/Columnar.html#geometry-types

Cloud-Native Geospatial Format Guide: GeoParquet

Comprehensive GeoParquet guide from Cloud-Native Geo Foundation

https://guide.cloudnativegeo.org/geoparquet/

Apache Iceberg 3 Geometry Support

Iceberg table format specification with geometry types

https://iceberg.apache.org/spec/

DuckDB Spatial Extension

DuckDB's GeoParquet support documentation

https://duckdb.org/docs/extensions/spatial.html

GeoParquet Performance Benchmarks

Performance comparisons vs Shapefile and GeoJSON

https://cloudnativegeo.org/blog/geoparquet-benchmarks/

Have a project in mind?

Get a Free Consultation

Location

Canberra
ACT, Australia