vector data

1. Import geometry data

[1]:

# !pip install h3-toolkit, geopandas, polars, psycopg2-binary, sqlalchemy, geoalchemy2, connectorx

[2]:

import os

import geopandas as gpd
import polars as pl
from dotenv import load_dotenv
from sqlalchemy import create_engine

load_dotenv()
URI = os.getenv("POSTGRES")

1.1. Use `geopandas` to fetch vector data from Postgis (slowly)

CSL researcher only

[3]:

sql = """
with boundary as (
    select 代碼, geometry
    from geometry.boundary_smallest
    where 縣市代碼 = %(city_code)s
)
select aps.*, boundary.geometry as geometry
from geometry.af_ppl_stats as aps
join boundary
on aps.codebase = boundary.代碼
"""

engine = create_engine(URI)
with engine.connect() as conn:
    gdf = gpd.read_postgis(
        sql,
        con=conn.connection,
        geom_col='geometry',
        params={'city_code': '63000'} # 台北市
    )

/Users/syuanbo/Documents/GitHub/H3-ToolKits/.venv/lib/python3.12/site-packages/geopandas/io/sql.py:185: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df = pd.read_sql(
/Users/syuanbo/Documents/GitHub/H3-ToolKits/.venv/lib/python3.12/site-packages/geopandas/io/sql.py:473: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  return pd.read_sql(spatial_ref_sys_sql, con)

[4]:

gdf.head()

[4]:

	codebase	city	h_cnt	p_cnt	m_cnt	f_cnt	info_time	geometry
0	A6301-0097-00	臺北市	40	121	53	68	2023-12-01 00:00:00+00:00	MULTIPOLYGON (((121.54425 25.06221, 121.54466 ...
1	A6303-0422-00	臺北市	146	376	165	211	2023-12-01 00:00:00+00:00	MULTIPOLYGON (((121.55701 25.03439, 121.5575 2...
2	A6303-1146-00	臺北市	93	229	110	119	2023-12-01 00:00:00+00:00	MULTIPOLYGON (((121.55202 25.02088, 121.55205 ...
3	A6311-0203-00	臺北市	58	159	80	79	2023-12-01 00:00:00+00:00	MULTIPOLYGON (((121.5243 25.11859, 121.52431 2...
4	A6308-0588-00	臺北市	112	299	142	157	2023-12-01 00:00:00+00:00	MULTIPOLYGON (((121.57742 24.99215, 121.57746 ...

1.2. Use `polars+connectorx` to fetch vector data from Postgis (faster, recommended)

CSL researcher only

[5]:

city_code = 63000

sql = f"""
with boundary as (
    select 代碼, geometry
    from geometry.boundary_smallest
    where 縣市代碼 = {city_code}
)
select aps.*, ST_AsBinary(boundary.geometry) as geometry_wkb
from geometry.af_ppl_stats as aps
join boundary
on aps.codebase = boundary.代碼
"""

pdf = pl.read_database_uri(sql, URI, engine="connectorx")
pdf.head()

[5]:

shape: (5, 8)

codebase	city	h_cnt	p_cnt	m_cnt	f_cnt	info_time	geometry_wkb
str	str	i64	i64	i64	i64	datetime[ns, UTC]	binary
"A6303-0422-00"	"臺北市"	146	376	165	211	2023-12-01 00:00:00 UTC	b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x09\x00\x00\x00\x14\x9c\xfc\x15\xa6c^@\xa5\xa9\xb8\x9a\xcd\x089@v\xd7\xe3&\xaec^@\x09\xbfMB\xcd\x089@\xef\xe7\x99\x1c\xaec"...
"A6303-1146-00"	"臺北市"	93	229	110	119	2023-12-01 00:00:00 UTC	b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0b\x00\x00\x006W\xca>Tc^@\x9a\xdb\x15GX\x059@\x0d\x06\x09\xdeTc^@&\x09\x07pV\x059@x\xcd\x94\x9cVc"...
"A6311-0203-00"	"臺北市"	58	159	80	79	2023-12-01 00:00:00 UTC	b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0b\x00\x00\x00\:83\x8ea^@\xee\x9a\x1a\xc5[\x1e9@\xe3'-:\x8ea^@\xc0n2\xc0[\x1e9@Jz\xd9\xf1\x92a"...
"A6308-0588-00"	"臺北市"	112	299	142	157	2023-12-01 00:00:00 UTC	b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1d\x00\x00\x00\V,o\xf4d^@H\xa8\x11\xcc\xfd\xfd8@qV\x9b\x09\xf5d^@\xc3\x93j\x8d\xf2\xfd8@O\x94G\xf6\xf4d"...
"A6310-1111-00"	"臺北市"	0	0	0	0	2023-12-01 00:00:00 UTC	b"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00"\x00\x00\x00Q\x91;\xf2\x9ce^@\x99\xc8\x19\xd8\xec\x109@-\xcf\xdd\xb4\xa3e^@!\xcfp\x09\xe1\x109@m%\xfa3\xa5e"...

1.3. Use `geopandas` to read geometry file

[6]:

data = gpd.read_file('data/test_geom.geojson')
data.plot(column='p_cnt',
               cmap='OrRd',         # 使用顏色映射
               scheme='naturalbreaks',  # 自然斷點分類
               k=5,
               legend=False
)

[6]:

<Axes: >

../_images/usage_01_vector_data_10_1.png

2. Geometry to h3

The H3Toolkit is a chainable class that can convert geometry data to h3 index.

Step1: use set_aggregation_strategy() to set the target column and the aggregation method in the dict { column_name : aggregation_function() }.
Step2: use process_from_vector() to set the data and the h3 resolution, this function will convert the geometry to h3 index in the designated resolution and process the column(s) with the aggregation method you set.
Step3: use get_result() to get the result. default is a polars dataframe with the h3 index and the aggregated column(s), if you want the actual geometry, you can set return_geometry=True to get the geometry column, and the return type will be the geopandas dataframe.

NOTE: this function always use the centroid of the h3 hexagon to extract the data from actual geometry object, so the result may loss some information if you choose the inproper resolution. If missing data exists, the warnging will be raised. you should choose the proper resolution to avoid this situation, ie from resolution 7 to 10. if you want the lower resolution, you can chain the result from process_from_vector() to process_from_h3() to get the lower resolution h3 index without lossing any information.

must to use get_result() to get the result, otherwise return will be the H3Toolkit object.

[7]:

from h3_toolkit import H3Toolkit
from h3_toolkit.aggregation import EqualSplit

toolkit = H3Toolkit()

result = (
    toolkit
    .set_aggregation_strategy(
        {
            'p_cnt' : EqualSplit(agg_col='codebase'),
        }
    )
    .process_from_vector(data, resolution=12, geometry_col='geometry')
)

output_df = result.get_result()

output_df.head()

[7]:

shape: (5, 2)

hex_id	p_cnt
str	f64
"8c4ba0a412a01ff"	14.208333
"8c4ba0a412a05ff"	14.208333
"8c4ba0a412a07ff"	14.208333
"8c4ba0a412a09ff"	14.208333
"8c4ba0a412a0bff"	14.208333

set return_geometry=True to get the geometry column in the result.

[8]:

output_gdf = result.get_result(return_geometry=True, fill_null_value=0)
output_gdf.head()

[8]:

	hex_id	p_cnt	geometry
0	8c4ba0a412a01ff	14.208333	POLYGON ((121.54706 25.04487, 121.54705 25.044...
1	8c4ba0a412a05ff	14.208333	POLYGON ((121.54695 25.04474, 121.54694 25.044...
2	8c4ba0a412a07ff	14.208333	POLYGON ((121.54713 25.04473, 121.54711 25.044...
3	8c4ba0a412a09ff	14.208333	POLYGON ((121.547 25.04501, 121.54698 25.04492...
4	8c4ba0a412a0bff	14.208333	POLYGON ((121.54717 25.045, 121.54716 25.04491...

3. Scale current h3 resolution to the target resolution

scale up the h3 index to the target resolution, the process_from_h3() function will do this job.

NOTE: The source_resolution must be greater than the target_resolution, otherwise the result will be wrong.

[9]:

from h3_toolkit import H3Toolkit
from h3_toolkit.aggregation import Sum

toolkit = H3Toolkit()

result = (
    toolkit
    .set_aggregation_strategy(
        {
            'p_cnt' : Sum(),
        }
    )
    .process_from_h3(
        output_df,
        source_resolution=12,
        target_resolution=10,
        h3_col='hex_id'
    )
)

output_df_10 = result.get_result(fill_null_value=0)
output_df_10.head()

[9]:

shape: (5, 2)

hex_id	p_cnt
str	f64
"8a4ba0a5d18ffff"	23.007257
"8a4ba0a43c4ffff"	603.692149
"8a4ba0a4e99ffff"	199.093091
"8a4ba0a5d19ffff"	11.574834
"8a4ba0a5db57fff"	485.178707

Now, you are successfully scale up the data from resolution 12 to 10.

The benefit of H3Toolkit is that you can chain all the functions together, it’s much simple and maintainable. and make the process logic more clear!

[10]:

toolkit = H3Toolkit()

result = (
    toolkit
    .set_aggregation_strategy(
        {
            ('p_cnt', 'h_cnt') : EqualSplit(agg_col='codebase'),
        }
    )
    .process_from_vector(data, resolution=12, geometry_col='geometry')
    .set_aggregation_strategy(
        {
            ('p_cnt', 'h_cnt') : Sum(),
        }
    )
    .process_from_h3(
        source_resolution=12,
        target_resolution=10,
        h3_col='hex_id'
    )
    .get_result()
)

result.head()

[10]:

shape: (5, 3)

hex_id	p_cnt	h_cnt
str	f64	f64
"8a4ba0a4e847fff"	595.238028	248.036192
"8a4ba0a5d30ffff"	2.853159	1.233115
"8a4ba0a4301ffff"	868.195243	351.344298
"8a4ba0a4e557fff"	587.340822	231.649744
"8a4ba0a5d8f7fff"	9.990734	3.819245

4. Visualization

Use the show_h3() function from h3_toolkit.visualization module to visualize the h3 data interactively with pydeck.

Supported classifiers (from mapclassify): NaturalBreaks, Quantiles, EqualInterval, Jenks, and more. Supported colormaps (from matplotlib): Oranges, viridis, RdYlGn, Blues, Greens, and many others.

[ ]:

from h3_toolkit.visualization import show_h3

# Visualize with default classifier (NaturalBreaks) and colormap (Oranges)
show_h3(output_df, 'p_cnt')

[ ]:

# Visualize resolution 12 data with default NaturalBreaks classifier
toolkit.show(target_col='p_cnt')

Due to performance limitations when rendering large H3 datasets on the web, here’s a screenshot example of the visualization. For an interactive map, you can clone the notebook and run it locally in your environment

res12

[ ]:

# Visualize resolution 10 data with Quantiles classifier
show_h3(output_df_10, 'p_cnt', classifier='Quantiles', k=5)

Due to performance limitations when rendering large H3 datasets on the web, here’s a screenshot example of the visualization. For an interactive map, you can clone the notebook and run it locally in your environment

res10

It’s not recommended using pydeck or kepler.gl to visualize the h3 data if the data is too large, because the h3 data is too dense to visualize, it will be very slow and may crash.

The better way to visual large dataset is using lonboard, but this package is not support the h3 data yet, so you should convert the h3 data to the geometry data before visualizing it.

[14]:

## TODO: Use case of lonborad

5. Customize the aggregation function

We provided some popular aggregation methods for doing the aggregation, read the h3_toolkit.aggregation class to get more information.

But you can also custom your specific aggregation function by yourself, just define a function that inherit from the AggregationStrategy class and implement the apply() function.

[15]:

from h3_toolkit import H3Toolkit
from h3_toolkit.aggregation import AggregationStrategy


class NewAggregationStrategy(AggregationStrategy):
    """
    Make sure to inherit from AggregationStrategy and implement the apply method
    with data and target_cols attributes.
    """
    def __init__(self, new_var):
        """
        If you want to pass some parameters to the strategy, you can do it here.
        """
        self.new_var = new_var


    def apply(data:pl.DataFrame, target_cols:list[str]) -> pl.DataFrame:
        """
        If you are not familiar with polars, you can use pandas instead.
        just make sure to convert the type to polars dataframe before returning it.
        >>> data = data.to_pandas()
        >>> ...
        >>> ...
        >>> ...
        >>> return pl.from_pandas(data)
        """
        pass


toolkit = H3Toolkit()
toolkit.set_aggregation_strategy(
    {
        ('col_name', '...') : NewAggregationStrategy(new_var='...'),
    }
)

[15]:

<h3_toolkit.core.H3Toolkit at 0x14396b2c0>