Database 25 November 2024 14 min read Sheece Gardezi

PostGIS + pgvector: 3 Billion Vectors in One Database

Earth Index runs PostGIS and pgvector in a single PostgreSQL instance to serve 3B embeddings for planetary environmental monitoring. One database for spatial queries and vector search.

PostGISpgvectorVector DatabaseRAGPostgreSQL

Abstract database visualization with connected nodes — Alina Grubnyak on Unsplash

Earth Index monitors deforestation and environmental change across the entire planet using 3 billion vector embeddings — all stored in PostgreSQL. Their monthly cost: roughly $12,000, versus an estimated $237,000 on a managed cloud vector service. The secret isn't exotic infrastructure. It's PostGIS and pgvector running side-by-side, executing spatial and semantic queries in a single round-trip.

Why Add a Third Database When PostgreSQL Already Runs Two?

The AI boom created demand for vector databases — Pinecone, Weaviate, Milvus, and others emerged as specialized solutions. But for teams already running PostgreSQL with PostGIS, adding another database means operational complexity, data synchronization headaches, and application-level joins between systems that should share a query planner.

pgvector closes this gap. It's an open-source PostgreSQL extension that adds vector data types and approximate nearest neighbor (ANN) search. Install it alongside PostGIS, and your spatial database speaks the language of embeddings — no new infrastructure required.

PostgreSQL and pgvector are now faster than Pinecone, 75% cheaper, and 100% open source. For datasets under 100 million vectors, general-purpose databases with vector extensions outperform specialized alternatives.

Timescale Benchmarks, 2024

Spatial + Semantic in a Single Query

The real power emerges when ST_ spatial functions and vector similarity operators execute together in one SQL statement:

Hybrid Query Examples

Find properties similar to this listing AND within 5km of a specific school
Retrieve satellite imagery embeddings for areas that match this deforestation pattern
Search for restaurants semantically similar to "cozy Italian bistro" within a polygon
Find the nearest 10 sites with environmental conditions matching this reference
Cluster geographic features by both location and semantic embedding

Each query executes in a single round-trip. No application-level joins between separate databases, no data consistency issues, no operational overhead of multiple systems.

Earth Index: 3 Billion Vectors at $12K/mo

Earth Index provides the most impressive validation at scale. Their planetary environmental monitoring system manages 3 billion vectors in PostgreSQL, combining VectorChord (a high-performance pgvector fork) with PostGIS.

The cost comparison: approximately $12,000 per month for their PostgreSQL setup versus an estimated $237,000 monthly for a comparable managed cloud vector service. They achieve this by co-locating vector embeddings with geospatial metadata — tile locations, administrative boundaries, proximity to features like rivers — eliminating cross-database data movement entirely.

spatial_vector_query.sql

-- Find similar satellite embeddings within a geographic region
SELECT tile_id, similarity
FROM satellite_embeddings
WHERE ST_Intersects(
    tile_geom,
    ST_MakeEnvelope(-122.5, 37.5, -122.0, 38.0, 4326)
)
AND embedding <-> query_embedding < 0.5
ORDER BY embedding <-> query_embedding
LIMIT 100;

pgvector 0.8.0: 9x Faster Queries, 32x Less Memory

The 0.8.0 release in late 2024 closed the performance gap with specialized vector databases:

Key Improvements

9× faster query processing

With iterative index scans

100× more relevant results

Through improved recall

3-5× query throughput

Over previous versions

Binary quantization

32× memory reduction with 95% accuracy

40-80% cost reduction

Compared to specialized vector databases

Iterative index scans are particularly important for combined spatial-vector queries. Previously, the index might return candidates that pass the vector similarity threshold but fail the spatial filter. Now, the index automatically scans deeper when needed, ensuring result quality.

HNSW vs. IVFFlat: Choosing the Right Index

pgvector offers two indexing approaches, each with distinct tradeoffs:

Index Comparison

HNSW (Hierarchical Navigable Small World) — Better query performance, slower builds, more memory. Best for read-heavy workloads.
IVFFlat (Inverted File Index) — Faster builds, less memory, requires periodic re-indexing. Better for frequently updated datasets.

For geospatial applications, HNSW is typically preferred. Satellite embeddings, location descriptions, and spatial feature vectors don't change frequently once computed. The build time investment pays off in query performance.

pgvectorscale: Billion-Vector Deployments on Disk

Timescale's pgvectorscale extension builds on pgvector with StreamingDiskANN, which stores part of the index on disk rather than requiring everything in memory. This makes billion-vector deployments economically viable on standard hardware.

PostgreSQL Is Enough for 95% of Geospatial AI

For geospatial AI applications, PostGIS + pgvector is now the default recommendation. The operational simplicity of a single database, combined with the ability to express complex spatial-semantic queries in SQL, outweighs the marginal performance benefits of specialized vector databases.

The exception is truly massive scale — hundreds of billions of vectors — where purpose-built systems still hold an edge. But for the vast majority of geospatial applications, PostgreSQL is enough, and enough is better than over-engineered.

References & Further Reading

pgvector GitHub Repository

Official pgvector extension for PostgreSQL

https://github.com/pgvector/pgvector

3 Billion Vectors in PostgreSQL to Protect the Earth

Earth Index case study on planetary-scale vector search

https://blog.vectorchord.ai/3-billion-vectors-in-postgresql-to-protect-the-earth

PostgreSQL as a Vector Database: Complete Guide

Timescale's comprehensive pgvector tutorial

https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector

pgvector and PostGIS: Unlocking Advanced PostgreSQL Use Cases

Practical guide to combining pgvector with PostGIS

https://blogs.vultr.com/PG-Vector-PostGIS

pgvector with PostGIS Example Repository

Code examples for hybrid spatial-vector queries

https://github.com/scitus-ca/pgvector_with_postgis

Have a project in mind?

Get a Free Consultation

Location

Canberra
ACT, Australia