Using Uber H3 Indexing Library in Postgres for Geospatial Data Analytics

Introduction

H3 is a hierarchical grid system designed to efficiently index and organize geospatial data. It divides the Earth’s surface into hexagonal cells of various sizes, creating a hierarchical structure that allows for spatial indexing and querying.

In context to postgres, uber h3 can be used to represented your spatial data as h3 indexes which are 64bit integers which can be used to turn your spatial operations into relational operations.

Spoiler: Query time went from 543 seconds to 1 second

TLDR;

Represent your spatial h3 index
Use relational DB for what’s it’s good at

Settting up

Get the environment up and running.

Creating docker image

You can skip this section and use the docker image in the last step. I only included these steps as a reference if you need to build the h3 extension in your own environment.

# Pull any postgis docker image
docker pull postgis/postgis:15-master
# Let's run a demo container where we can install h3 extension
docker run -it --name postgis-h3 -e POSTGRES_PASSWORD=postgres postgis/postgis:15-master

# shell access into the docker
docker exec -it -u root postgis-h3 bash

# Install the dependencies
apt update
apt install -y pip libpq-dev postgresql-server-dev-15
pip install pgxnclient cmake
# Install the extension
pgxn install h3

# Remove the dev dependencies
pip uninstall pgxnclient cmake
apt purge -y libpq-dev postgresql-server-dev-15 pip
exit

docker commit postgis-h3 postgis-h3
docker stop postgis-h3
docker rm postgis-h3
docker run -it --name postgis-h3 -d  -p 6432:5432 -e POSTGRES_PASSWORD=postgres postgis-h3

Create docker container from the pre-built image

docker run -it --name postgis-h3 -d -p 6432:5432 -e POSTGRES_PASSWORD=postgres  jsonsingh/postgis-h3

Test the extension

PGPASSWORD=postgres psql -p 6432 -h localhost

-- Create a temporary database
Create database temp;
\c temp;
create extension h3;
create extension h3_postgis CASCADE;
SELECT h3_lat_lng_to_cell(POINT('37.3615593,-122.0553238'), 5);

-- Clean up
\c postgres
Drop database temp

Let’s add some data

We’ll be using the data from OSM but any spatial can be used for this exercise.

Get the OSM data from https://download.geofabrik.de/ Choose whichever ones you want. I’ll be using Australia and Oceania.

# Just to make our life easy
export PGPASSWORD=postgres
export PGPORT=6432
export PGHOST=localhost

# Needed to load osm data to postgres
apt install osm2pgsql

# our database and extension
psql -c "CREATE database osm_data;"
psql -d osm_data -c "CREATE extension postgis;"

# This will take some time
osm2pgsql -c -d osm_data -x -E 4326 <osm_file>
# Mulitple OSM files
osm2pgsql -a -d osm_data -x -E 4326 <osm_file_2>

Defining the Problem statement (Point in Polygon)

For this sample dataset. We’ll take admin boundary level 6 (polygons) and points (trees) and see which tree lies in which admin boundary. It’s a simple point in polygon problem.

Creating separate tables for polygons and points

Let’s get a subset of data

PGPASSWORD=postgres psql -p 6432 -h localhost -d osm_data

CREATE TABLE planet_osm_polygon_admin_6 AS SELECT DISTINCT ON (osm_id) * FROM planet_osm_polygon WHERE admin_level = '6';
CREATE TABLE planet_osm_trees AS SELECT * FROM planet_osm_point WHERE "natural" = 'tree';

SELECT COUNT(*) FROM planet_osm_polygon_admin_6;
-- 5518

SELECT COUNT(*) FROM planet_osm_trees;
-- 799130

Point in Polygon (PostGIS)

In postGIS we can simply use ST_Within

-- Not let's try to find all the trees in these countries
SELECT
    b.osm_id,
    COUNT(a.osm_id)
FROM
    planet_osm_trees a
    RIGHT JOIN planet_osm_polygon_admin_6 b ON ST_Within (a.way, b.way)
GROUP BY
    b.osm_id
ORDER BY
    COUNT;
-- took 343 seconds

Point in Polygon (Uber h3)

We’ll use h3 index of level 7 to represent our points and polygons and use a join using = which you’ll see is considerably faster

CREATE extension h3;
CREATE extension h3_postgis CASCADE;
-- Let's try the h3 grid. Calculate all the h3 index for all trees
ALTER TABLE planet_osm_trees ADD COLUMN h3_index h3index GENERATED ALWAYS AS (h3_lat_lng_to_cell(way::POINT, 7)) STORED;

-- For polygons we need to do a bit of extra work
-- This function takes a polygon and gives all the h3 index inside that polygon
CREATE OR REPLACE FUNCTION get_h3_indexes(shape geometry, index integer)
  RETURNS h3index[] AS $$
DECLARE
  h3_indexes h3index[];
BEGIN
  SELECT ARRAY(
    SELECT h3_polygon_to_cells(shape, index)
  ) INTO h3_indexes;

  RETURN h3_indexes;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

-- We calculate all the h3 index for all the shapes. This will take a lot of time but it's a one time process
ALTER TABLE planet_osm_polygon_admin_6  ADD COLUMN h3_indexes h3index[] GENERATED ALWAYS AS (get_h3_indexes(way,7)) STORED;
-- 30 seconds

-- Only for visualization purpose
ALTER TABLE planet_osm_trees ADD COLUMN h3_shape geometry GENERATED ALWAYS AS (h3_cell_to_boundary_geometry(h3_lat_lng_to_cell(way::POINT, 7))) STORED;

-- Create a flat table for faster joins and visualize this table in QGIS
CREATE TABLE
    planet_osm_polygon_admin_6_flat AS
SELECT
    h3_index,
    osm_id,
    -- This is just for visualization and not necessary
    h3_cell_to_boundary_geometry (h3_index)
FROM
    (
        SELECT
            UNNEST (h3_indexes) AS h3_index,
            osm_id
        FROM
            planet_osm_polygon_admin_6
    ) AS a;
-- 8 seconds

-- We compare the h3 index and find all the shape ids
SELECT
    b.osm_id,
    COUNT(a.osm_id)
FROM
    planet_osm_trees a
    JOIN planet_osm_polygon_admin_6_flat b ON a.h3_index = b.h3_index
GROUP BY
    b.osm_id
ORDER BY
    COUNT;
-- takes 1 sec

The above query took 1 seconds compared to 543 seconds which is a significant improvement and currently there’s no indexing involved.

Trees represented as h3 index
Trees and admin shapes as h3 indexes

Compact Polygon

Polygon can be represented by different levels of h3 indexes. This can be useful when you need to partition the data.

-- create a function which returns compacted h3
CREATE OR REPLACE FUNCTION get_h3_indexes_compact(shape geometry, index integer) RETURNS h3index[] AS $$ DECLARE h3_indexes h3index[];
BEGIN WITH cells AS (
  SELECT
    h3_polygon_to_cells(shape, index) AS cell_array
)
SELECT
  Array(
    SELECT
      h3_compact_cells(
        array_agg(cell_array)
      )
    FROM
      cells
  ) into h3_indexes;
RETURN h3_indexes;
END;
$$ LANGUAGE plpgsql IMMUTABLE;


-- Let's create the compacted array indice
ALTER TABLE
  planet_osm_polygon_admin_6
ADD
  COLUMN h3_indexes_compacted h3index[] GENERATED ALWAYS AS (
    get_h3_indexes_compact(way, 7)
  ) STORED;


-- Visualize the below table in QGIS
CREATE TABLE
    planet_osm_polygon_admin_6_flat_compact AS
SELECT
    h3_index,
    osm_id,
    h3_cell_to_boundary_geometry (h3_index)
FROM
    (
        SELECT
            UNNEST (h3_indexes_compacted) AS h3_index,
            osm_id
        FROM
            planet_osm_polygon_admin_6
    ) AS a;

Polygons represented as compact h3 index

More things to try

Using h3 raster functions to work with rasters in postgis
Nearest neighbors (kRing), helpful especially in ML analysis
Edge function, moving from one cell to the next creating a path with h3index
Optimizing queries using postgres techniques such as indexing and partitioning

Sources:

Introduction#

Settting up#

Creating docker image#

Test the extension#

Let’s add some data#

Defining the Problem statement (Point in Polygon)#

Creating separate tables for polygons and points#

Point in Polygon (PostGIS)#

Point in Polygon (Uber h3)#

Compact Polygon#

More things to try#