vector data

1. Import geometry data

[1]:
# !pip install h3-toolkit, geopandas, polars, psycopg2-binary, sqlalchemy, geoalchemy2, connectorx
[2]:
import os

import geopandas as gpd
import polars as pl
from dotenv import load_dotenv
from sqlalchemy import create_engine

load_dotenv()
URI = os.getenv("POSTGRES")

1.1. Use geopandas to fetch vector data from Postgis (slowly)

CSL researcher only

[3]:
sql = """
with boundary as (
    select 代碼, geometry
    from geometry.boundary_smallest
    where 縣市代碼 = %(city_code)s
)
select aps.*, boundary.geometry as geometry
from geometry.af_ppl_stats as aps
join boundary
on aps.codebase = boundary.代碼
"""

engine = create_engine(URI)
with engine.connect() as conn:
    gdf = gpd.read_postgis(
        sql,
        con=conn.connection,
        geom_col='geometry',
        params={'city_code': '63000'} # 台北市
    )
/Users/syuanbo/Documents/GitHub/H3-ToolKits/.venv/lib/python3.12/site-packages/geopandas/io/sql.py:185: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df = pd.read_sql(
/Users/syuanbo/Documents/GitHub/H3-ToolKits/.venv/lib/python3.12/site-packages/geopandas/io/sql.py:473: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  return pd.read_sql(spatial_ref_sys_sql, con)
[4]:
gdf.head()
[4]:
codebase city h_cnt p_cnt m_cnt f_cnt info_time geometry
0 A6301-0097-00 臺北市 40 121 53 68 2023-12-01 00:00:00+00:00 MULTIPOLYGON (((121.54425 25.06221, 121.54466 ...
1 A6303-0422-00 臺北市 146 376 165 211 2023-12-01 00:00:00+00:00 MULTIPOLYGON (((121.55701 25.03439, 121.5575 2...
2 A6303-1146-00 臺北市 93 229 110 119 2023-12-01 00:00:00+00:00 MULTIPOLYGON (((121.55202 25.02088, 121.55205 ...
3 A6311-0203-00 臺北市 58 159 80 79 2023-12-01 00:00:00+00:00 MULTIPOLYGON (((121.5243 25.11859, 121.52431 2...
4 A6308-0588-00 臺北市 112 299 142 157 2023-12-01 00:00:00+00:00 MULTIPOLYGON (((121.57742 24.99215, 121.57746 ...

1.2. Use polars+connectorx to fetch vector data from Postgis (faster, recommended)

CSL researcher only

[5]:
city_code = 63000

sql = f"""
with boundary as (
    select 代碼, geometry
    from geometry.boundary_smallest
    where 縣市代碼 = {city_code}
)
select aps.*, ST_AsBinary(boundary.geometry) as geometry_wkb
from geometry.af_ppl_stats as aps
join boundary
on aps.codebase = boundary.代碼
"""

pdf = pl.read_database_uri(sql, URI, engine="connectorx")
pdf.head()

[5]:
shape: (5, 8)
codebasecityh_cntp_cntm_cntf_cntinfo_timegeometry_wkb
strstri64i64i64i64datetime[ns, UTC]binary
"A6303-0422-00""臺北市"1463761652112023-12-01 00:00:00 UTCb"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x09\x00\x00\x00\x14\x9c\xfc\x15\xa6c^@\xa5\xa9\xb8\x9a\xcd\x089@v\xd7\xe3&\xaec^@\x09\xbfMB\xcd\x089@\xef\xe7\x99\x1c\xaec"...
"A6303-1146-00""臺北市"932291101192023-12-01 00:00:00 UTCb"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0b\x00\x00\x006W\xca>Tc^@\x9a\xdb\x15GX\x059@\x0d\x06\x09\xdeTc^@&\x09\x07pV\x059@x\xcd\x94\x9cVc"...
"A6311-0203-00""臺北市"5815980792023-12-01 00:00:00 UTCb"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x0b\x00\x00\x00\:83\x8ea^@\xee\x9a\x1a\xc5[\x1e9@\xe3'-:\x8ea^@\xc0n2\xc0[\x1e9@Jz\xd9\xf1\x92a"...
"A6308-0588-00""臺北市"1122991421572023-12-01 00:00:00 UTCb"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00\x1d\x00\x00\x00\V,o\xf4d^@H\xa8\x11\xcc\xfd\xfd8@qV\x9b\x09\xf5d^@\xc3\x93j\x8d\xf2\xfd8@O\x94G\xf6\xf4d"...
"A6310-1111-00""臺北市"00002023-12-01 00:00:00 UTCb"\x01\x06\x00\x00\x00\x01\x00\x00\x00\x01\x03\x00\x00\x00\x01\x00\x00\x00"\x00\x00\x00Q\x91;\xf2\x9ce^@\x99\xc8\x19\xd8\xec\x109@-\xcf\xdd\xb4\xa3e^@!\xcfp\x09\xe1\x109@m%\xfa3\xa5e"...

1.3. Use geopandas to read geometry file

[6]:
data = gpd.read_file('data/test_geom.geojson')
data.plot(column='p_cnt',
               cmap='OrRd',         # 使用顏色映射
               scheme='naturalbreaks',  # 自然斷點分類
               k=5,
               legend=False
)
[6]:
<Axes: >
../_images/usage_01_vector_data_10_1.png

2. Geometry to h3

The H3Toolkit is a chainable class that can convert geometry data to h3 index.

  • Step1: use set_aggregation_strategy() to set the target column and the aggregation method in the dict { column_name : aggregation_function() }.

  • Step2: use process_from_vector() to set the data and the h3 resolution, this function will convert the geometry to h3 index in the designated resolution and process the column(s) with the aggregation method you set.

  • Step3: use get_result() to get the result. default is a polars dataframe with the h3 index and the aggregated column(s), if you want the actual geometry, you can set return_geometry=True to get the geometry column, and the return type will be the geopandas dataframe.

NOTE: this function always use the centroid of the h3 hexagon to extract the data from actual geometry object, so the result may loss some information if you choose the inproper resolution. If missing data exists, the warnging will be raised. you should choose the proper resolution to avoid this situation, ie from resolution 7 to 10. if you want the lower resolution, you can chain the result from process_from_vector() to process_from_h3() to get the lower resolution h3 index without lossing any information.

must to use get_result() to get the result, otherwise return will be the H3Toolkit object.

[7]:
from h3_toolkit import H3Toolkit
from h3_toolkit.aggregation import EqualSplit

toolkit = H3Toolkit()

result = (
    toolkit
    .set_aggregation_strategy(
        {
            'p_cnt' : EqualSplit(agg_col='codebase'),
        }
    )
    .process_from_vector(data, resolution=12, geometry_col='geometry')
)

output_df = result.get_result()

output_df.head()
[7]:
shape: (5, 2)
hex_idp_cnt
strf64
"8c4ba0a412a01ff"14.208333
"8c4ba0a412a05ff"14.208333
"8c4ba0a412a07ff"14.208333
"8c4ba0a412a09ff"14.208333
"8c4ba0a412a0bff"14.208333

set return_geometry=True to get the geometry column in the result.

[8]:
output_gdf = result.get_result(return_geometry=True, fill_null_value=0)
output_gdf.head()
[8]:
hex_id p_cnt geometry
0 8c4ba0a412a01ff 14.208333 POLYGON ((121.54706 25.04487, 121.54705 25.044...
1 8c4ba0a412a05ff 14.208333 POLYGON ((121.54695 25.04474, 121.54694 25.044...
2 8c4ba0a412a07ff 14.208333 POLYGON ((121.54713 25.04473, 121.54711 25.044...
3 8c4ba0a412a09ff 14.208333 POLYGON ((121.547 25.04501, 121.54698 25.04492...
4 8c4ba0a412a0bff 14.208333 POLYGON ((121.54717 25.045, 121.54716 25.04491...

3. Scale current h3 resolution to the target resolution

scale up the h3 index to the target resolution, the process_from_h3() function will do this job.

NOTE: The source_resolution must be greater than the target_resolution, otherwise the result will be wrong.

[9]:
from h3_toolkit import H3Toolkit
from h3_toolkit.aggregation import Sum

toolkit = H3Toolkit()

result = (
    toolkit
    .set_aggregation_strategy(
        {
            'p_cnt' : Sum(),
        }
    )
    .process_from_h3(
        output_df,
        source_resolution=12,
        target_resolution=10,
        h3_col='hex_id'
    )
)

output_df_10 = result.get_result(fill_null_value=0)
output_df_10.head()
[9]:
shape: (5, 2)
hex_idp_cnt
strf64
"8a4ba0a5d18ffff"23.007257
"8a4ba0a43c4ffff"603.692149
"8a4ba0a4e99ffff"199.093091
"8a4ba0a5d19ffff"11.574834
"8a4ba0a5db57fff"485.178707

Now, you are successfully scale up the data from resolution 12 to 10.

The benefit of H3Toolkit is that you can chain all the functions together, it’s much simple and maintainable. and make the process logic more clear!

[10]:
toolkit = H3Toolkit()

result = (
    toolkit
    .set_aggregation_strategy(
        {
            ('p_cnt', 'h_cnt') : EqualSplit(agg_col='codebase'),
        }
    )
    .process_from_vector(data, resolution=12, geometry_col='geometry')
    .set_aggregation_strategy(
        {
            ('p_cnt', 'h_cnt') : Sum(),
        }
    )
    .process_from_h3(
        source_resolution=12,
        target_resolution=10,
        h3_col='hex_id'
    )
    .get_result()
)

result.head()
[10]:
shape: (5, 3)
hex_idp_cnth_cnt
strf64f64
"8a4ba0a4e847fff"595.238028248.036192
"8a4ba0a5d30ffff"2.8531591.233115
"8a4ba0a4301ffff"868.195243351.344298
"8a4ba0a4e557fff"587.340822231.649744
"8a4ba0a5d8f7fff"9.9907343.819245

4. Visualization

Use the show_h3() function from h3_toolkit.visualization module to visualize the h3 data interactively with pydeck.

Supported classifiers (from mapclassify): NaturalBreaks, Quantiles, EqualInterval, Jenks, and more. Supported colormaps (from matplotlib): Oranges, viridis, RdYlGn, Blues, Greens, and many others.

[ ]:
from h3_toolkit.visualization import show_h3

# Visualize with default classifier (NaturalBreaks) and colormap (Oranges)
show_h3(output_df, 'p_cnt')
[ ]:
# Visualize resolution 12 data with default NaturalBreaks classifier
toolkit.show(target_col='p_cnt')

Due to performance limitations when rendering large H3 datasets on the web, here’s a screenshot example of the visualization. For an interactive map, you can clone the notebook and run it locally in your environment

res12

[ ]:
# Visualize resolution 10 data with Quantiles classifier
show_h3(output_df_10, 'p_cnt', classifier='Quantiles', k=5)

Due to performance limitations when rendering large H3 datasets on the web, here’s a screenshot example of the visualization. For an interactive map, you can clone the notebook and run it locally in your environment

res10

It’s not recommended using pydeck or kepler.gl to visualize the h3 data if the data is too large, because the h3 data is too dense to visualize, it will be very slow and may crash.

The better way to visual large dataset is using lonboard, but this package is not support the h3 data yet, so you should convert the h3 data to the geometry data before visualizing it.

[14]:
## TODO: Use case of lonborad

5. Customize the aggregation function

We provided some popular aggregation methods for doing the aggregation, read the h3_toolkit.aggregation class to get more information.

But you can also custom your specific aggregation function by yourself, just define a function that inherit from the AggregationStrategy class and implement the apply() function.

[15]:
from h3_toolkit import H3Toolkit
from h3_toolkit.aggregation import AggregationStrategy


class NewAggregationStrategy(AggregationStrategy):
    """
    Make sure to inherit from AggregationStrategy and implement the apply method
    with data and target_cols attributes.
    """
    def __init__(self, new_var):
        """
        If you want to pass some parameters to the strategy, you can do it here.
        """
        self.new_var = new_var


    def apply(data:pl.DataFrame, target_cols:list[str]) -> pl.DataFrame:
        """
        If you are not familiar with polars, you can use pandas instead.
        just make sure to convert the type to polars dataframe before returning it.
        >>> data = data.to_pandas()
        >>> ...
        >>> ...
        >>> ...
        >>> return pl.from_pandas(data)
        """
        pass


toolkit = H3Toolkit()
toolkit.set_aggregation_strategy(
    {
        ('col_name', '...') : NewAggregationStrategy(new_var='...'),
    }
)
[15]:
<h3_toolkit.core.H3Toolkit at 0x14396b2c0>