Preparation: Raw->Hll Conversion

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024

No description has been provided for this image
•••
Out[4]:

Last updated: Aug-18-2023, Carto-Lab Docker Version 0.14.0

Data transformation for Milvus milvus quantitative analysis (Flickr, iNaturalist).

This notebook transforms collected data into quantitative HyperLogLog data for analysis and archiving. The original raw data can be removed afterwards. See publication:

Dunkel, A., Löchner, M., & Burghardt, D. (2020). Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation. ISPRS International Journal of Geo-Information, 9(10), 607. https://doi.org/10.3390/ijgi9100607

Prepare environment

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to to, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d
•••
List of package versions used in this notebook
package python Fiona Shapely bokeh colorcet geopandas geoviews holoviews hvplot ipywidgets
version 3.9.15 1.8.20 1.7.1 2.4.3 3.0.1 0.13.2 1.9.5 1.14.8 0.8.4 8.0.7
package mapclassify matplotlib matplotlib-venn numpy pandas python-dotenv xarray
version 2.5.0 3.7.1 0.11.9 1.22.4 2.0.3 1.0.0 2023.6.0

Load dependencies:

In [2]:
import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML

To reduce the code shown in this notebook, some helper methods are made available in a separate file.

Load helper module from ../py/modules/base/tools.py.

In [3]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools

Activate autoreload of changed python files:

In [4]:
%load_ext autoreload
%autoreload 2

Parameters

Define initial parameters that affect processing

In [5]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory                     
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
In [6]:
for folder in [WORK_DIR, OUTPUT]:
    folder.mkdir(exist_ok=True)

Load dotfiles environment variables

In [7]:
from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)
Out[7]:
True
In [8]:
DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB")    # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB")    # lbsn-rawdb name

Raw to HLL conversion

For calculating with HLL data, we are using an empty Postgres database with the Citus HLL extension installed. Specifically, we are using pg-hll-empty here, a Docker-postgres container that is prepared for HLL calculation. You can use any Postgres from anywhere, as long as it has the citus hll extension installed.

If you haven't, startup the container locally next to Carto-Lab Docker now:

cd pg-hll-empty
docker compose up -d
In [9]:
DB_USER = "hlluser"
DB_PASS = os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST = "127.0.0.1"
DB_PORT = "5452"
DB_NAME = "hllworkerdb"
In [10]:
DB_CONN = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT ,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASS
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
SELECT 1

Simplify query access:

In [11]:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Out[11]:
?column?
0 1
If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback() once, to reset the cursor.
In [41]:
db_connection_hll.rollback()

Create Query Schema

Create a new schema called mviews and update Postgres search_path, to include new schema:

In [12]:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
                  social,
                  spatial,
                  temporal,
                  topical,
                  interlinkage,
                  extensions,
                  mviews;"""

Since the above query will not return any result, we'll directly use the psycopg2 cursor object:

In [13]:
cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
ALTER DATABASE

By using Foreign Table, this step will establish the connection between hlldb to rawdb.

On hlldb, install postgres_fdw extension:

In [14]:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE EXTENSION

iNaturalist Aves dataset

Check if foreign table has been imported already:

In [21]:
raw_table_name = 'inaturalist_birds_reduced'
In [22]:
result = tools.check_table_exists(db_conn_hll, raw_table_name)
print(result)
True

Conditional load password - this only need to be done once, if the server hasn't been added before.

In [23]:
if not result:
    import getpass
    USER_KEY = getpass.getpass()

Create Foreign Server connection to rawdb, on hlldb:

In [25]:
if not result:
    sql_query = f"""
    CREATE SERVER IF NOT EXISTS lbsnraw 
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (
        host '{DB_NAME_RAWDB}',
        dbname '{DB_HOST_RAWDB}',
        port '5432',
        keepalives '1',
        keepalives_idle '30',
        keepalives_interval '10',
        keepalives_count '5',
        fetch_size '500000');
    CREATE USER MAPPING IF NOT EXISTS for postgres
        SERVER lbsnraw 
        OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
    """
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

Import foreign table definition on the hlldb.

In [26]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {raw_table_name})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

test

In [27]:
db_conn.query(f"SELECT * FROM mviews.{raw_table_name} LIMIT 10;")
Out[27]:
post_guid user_guid post_create_date post_latlng emoji
0 152813902 45f2c247710cc880f54ac4f9ea511d61 2023-03-28 12:56:00 0101000020E6100000EE3D5C72DC1B4C40054F2157EAFD... [🐦]
1 6255134 3788ba855a73128265c739b89b087ea7 2017-01-29 13:09:00 0101000020E610000015E126A3CA2E4C40FBEAAA402D06... [🐦]
2 142298717 eff17bae33eb23518c139c2df4865e71 2022-11-18 15:07:00 0101000020E61000000DFB3DB14EF74B40DD99098673A5... [🐦]
3 131162197 45f2c247710cc880f54ac4f9ea511d61 2022-08-17 08:22:00 0101000020E61000002F6EA301BC274C40817C09151C56... [🐦]
4 145574835 508051355f706c5ebc1b1bb882e16fa6 2023-01-01 00:17:37 0101000020E6100000F96A47718E3A4B40234DBC033C71... [🐦]
5 82046712 ef6162c046df2b642ff770a9d2f0dd5a 2021-05-25 19:15:00 0101000020E61000001E1B81785DB94B40E6797077D636... [🐦]
6 20964717 b21dfb148d20b1febdd8d86417f925c1 2019-02-07 19:13:00 0101000020E61000006AC18BBE822C4B40AC3C81B0537C... [🐦]
7 105140579 aedc59674ef56669a136181004078c17 2022-01-18 08:38:00 0101000020E6100000CAFD0E4581904B40F75B3B511212... [🐦]
8 120205737 92503b5c26f1da882c65713c665678fd 2022-06-04 23:03:02 0101000020E61000003259DC7F64904B401F2C63433713... [🐦]
9 36524505 2c5eba98a6b8c04d1e726ec2516434dd 2019-12-10 10:53:01 0101000020E6100000F0DDE68D93BE4B406E13EE95794F... [🐦]

Commit changes to hlldb

In [23]:
db_connection_hll.commit()

Prepare conversion of raw data to hll

HyperLogLog parameters

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

In [28]:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Out[28]:
hll_set_defaults
0 (11,5,-1,1)
In [57]:
db_conn_hll.query("SELECT wkb_geometry from spatial.milvus_milvus_range_sub;")
Out[57]:
wkb_geometry
0 0106000020E61000005A00000001030000000100000004...

Aggregation step

  • Convert data to Hll
  • filter by space (Milvus milvus range)
  • group by month, year
  • order by year, month
In [61]:
def materialized_view_hll(table_name_src: str, table_name_dest, schema: str = None, additional_cols: [str] = None) -> str:
    """Returns raw SQL for creating a materialized view with HLL aggregate"""
    if not schema:
        schema = 'mviews'
    if additional_cols is None:
        additional_cols = []
    return f"""
        DROP MATERIALIZED VIEW IF EXISTS {schema}.{table_name_dest};
        
        CREATE MATERIALIZED VIEW {schema}.{table_name_dest} AS
            WITH polies AS (SELECT wkb_geometry from spatial.milvus_milvus_range_sub)
            SELECT 
                EXTRACT(MONTH FROM post_create_date) AS "month",
                EXTRACT(YEAR FROM post_create_date) AS "year",
                hll_add_agg((hll_hash_text(post_guid))) AS "post_hll",
                hll_add_agg((hll_hash_text(user_guid))) AS "user_hll"
                {''.join([f",{x}" for x in additional_cols])}
            FROM {schema}.{table_name_src}, polies
            WHERE ST_Intersects(post_latlng, wkb_geometry)
            GROUP BY year, month{''.join([f",{x}" for x in additional_cols if len(additional_cols) > 0])}
            ORDER BY year ASC, month ASC;
        """
In [60]:
db_connection_hll.rollback()
In [62]:
%%time
destination_table = "inaturalist_birds_month"
origin_table = raw_table_name
sql_query = materialized_view_hll(
    table_name_src=origin_table, table_name_dest=destination_table)
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 481
CPU times: user 999 µs, sys: 385 µs, total: 1.38 ms
Wall time: 18.2 s

Test:

In [63]:
db_conn.query(f"SELECT * FROM mviews.{destination_table} LIMIT 10;")
Out[63]:
month year post_hll user_hll
0 5.0 1920.0 \x138b4038c25fc1 \x138b404023
1 4.0 1962.0 \x138b4012c4 \x138b40fee1
2 9.0 1962.0 \x138b407222 \x138b40d6e2
3 7.0 1965.0 \x138b4068a1 \x138b4073a3
4 9.0 1965.0 \x138b40594486e19323 \x138b4073a3
5 1.0 1966.0 \x138b404782 \x138b408b85
6 2.0 1966.0 \x138b40de01 \x138b404b41
7 5.0 1966.0 \x138b40064445614be17b448541c9c3 \x138b4073a3
8 6.0 1966.0 \x138b400702 \x138b40fee1
9 5.0 1967.0 \x138b4049a1d2e3 \x138b4073a37582
In [64]:
db_connection_hll.commit()

Export data as CSV

Save hll data to CSV. The following records are available from table spatial.latlng:

  • year - distinct year
  • month - month
  • post_hll - approximate post guids stored as hll set
  • user_hll - approximate user guids stored as hll set
In [65]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll
    FROM mviews.{destination_table};
    """
df = db_conn.query(sql_query)
# use type int instead of float
time_cols = ["year", "month"]
# drop where time cols are invalid 
df.dropna(subset=time_cols, inplace=True)
# turn float to int
for col in time_cols:
    df[col] = df[col].astype(int)
# we can also remove any rows where the year is < 2007
df.drop(df[df['year'] < 2007].index, inplace = True)
In [66]:
df.head()
Out[66]:
year month post_hll user_hll
284 2007 1 \x138b40026402a204010542096210a116221a811b011c... \x138b400841136518e12681276228e1296232c355a15c...
285 2007 2 \x138b4004a105a30e810f45106211c115c1166519c11a... \x138b400501296232c33a413ae14b214ca150e155a16c...
286 2007 3 \x138b40014101a2066309210a220a810ac20b050be20c... \x138b4002210282070116611862276229622e8332c33a...
287 2007 4 \x138b40016101a203e10463058306c107e1084308810b... \x138b400421070108640b010fe31342252125e426a127...
288 2007 5 \x138b40022102a302e304a5078207a808a10b210c410d... \x138b4007010fa11ba526a1276228022b012da12e8332...
In [67]:
usecols = ["year", "month", "post_hll", "user_hll"]
df.to_csv(
    OUTPUT / f"{destination_table}.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Flickr Cherry dataset

Check if foreign table has been imported already:

In [20]:
raw_table_name = 'flickr_cherries_reduced'
In [21]:
result = tools.check_table_exists(db_conn_hll, raw_table_name)
print(result)
False

Conditional load password - this only need to be done once, if the server hasn't been added before.

Import foreign table definition on the hlldb.

In [22]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {raw_table_name})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
IMPORT FOREIGN SCHEMA

test

In [23]:
db_conn.query(f"SELECT * FROM mviews.{raw_table_name} LIMIT 10;")
Out[23]:
post_guid user_guid post_create_date post_latlng
0 k0CVR8YLJcX+txr2omtAqQRcSobKUU5NWc76XlZWG/4 xXwaWu7h0F8LZp1ZqPbGxaijfYZ65U7omtnkf3F8hxs 2013-05-08 16:36:46 0101000020E610000035CF11F92EE5F7BF0FD253E410B1...
1 lzCL/fdFUuGA7cCPOrsECZC/4LzpSxbH7khVAbXBqoA JZkKtw6Ds8qMDqM33z6jmqck4BwoQ4AIirdcy0fVGno 2010-04-30 16:24:53 0101000020E61000003A4030478FDFF7BFF3E7DB82A5B0...
2 a9n1ML9khGs1bVx4DwaHfgM1qxZqMwP2HOep8CiU+ME JZkKtw6Ds8qMDqM33z6jmqck4BwoQ4AIirdcy0fVGno 2010-04-30 16:24:30 0101000020E61000003A4030478FDFF7BFF3E7DB82A5B0...
3 vatbfVaVEOgUQcaNkJEiXbm1hBYB7TISWxddkCBTsW8 JZkKtw6Ds8qMDqM33z6jmqck4BwoQ4AIirdcy0fVGno 2010-04-30 16:23:53 0101000020E61000003A4030478FDFF7BFF3E7DB82A5B0...
4 EfKlhwJQEaobcBm8SmH3VZ1h9X3+J4NtxZQdGYIF0Fs 63F6LbXI5lij45Tz0CLvC8ezEKBzQ5IkWNwq598CtB4 2014-04-07 14:47:08 0101000020E6100000FB22A12DE7D2F7BF5053CBD6FAAA...
5 Ki2CL+n2fRFU3vtfJz1+dJTyaPlmgmwDNNZZcvj+pII vb7CRl+k+/XS5/q+aJv6hXOQkr6br2UF0xETgL58ziE 2008-05-03 18:22:06 0101000020E610000050AA7D3A1E33F7BF397D3D5FB370...
6 UgKW7+6kbhUCNtQLo6rGdCtNf5kJp3+wnAXvr3ez1qE vb7CRl+k+/XS5/q+aJv6hXOQkr6br2UF0xETgL58ziE 2008-05-03 18:22:06 0101000020E610000050AA7D3A1E33F7BF397D3D5FB370...
7 gP6R2IN1GfX3SXHSMu4wLwa6GZkT6ukyCCiPdPrTEGI vb7CRl+k+/XS5/q+aJv6hXOQkr6br2UF0xETgL58ziE 2008-05-03 18:21:13 0101000020E610000050AA7D3A1E33F7BF397D3D5FB370...
8 5y+xOCu9JvUSXkISwYoU5vFHEX6CPjdAT3Ex9axM6wI vb7CRl+k+/XS5/q+aJv6hXOQkr6br2UF0xETgL58ziE 2008-05-03 18:14:18 0101000020E610000050AA7D3A1E33F7BF397D3D5FB370...
9 H0NmLpSoiZrEdT2dA3UzF4OZfECX4kQH1SbQw8BXrng vb7CRl+k+/XS5/q+aJv6hXOQkr6br2UF0xETgL58ziE 2008-05-03 18:13:33 0101000020E610000050AA7D3A1E33F7BF397D3D5FB370...
In [24]:
db_conn.query(f"SELECT count(*) FROM mviews.{raw_table_name};")
Out[24]:
count
0 100662

Commit changes to hlldb

In [25]:
db_connection_hll.commit()

Prepare conversion of raw data to hll

HyperLogLog parameters

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

In [26]:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Out[26]:
hll_set_defaults
0 (11,5,-1,1)

Aggregation step

  • Convert data to Hll
  • group by month, year
  • order by year, month
In [32]:
def materialized_view_hll(table_name_src: str, table_name_dest, schema: str = None, additional_cols: [str] = None) -> str:
    """Returns raw SQL for creating a materialized view with HLL aggregate"""
    if not schema:
        schema = 'mviews'
    if additional_cols is None:
        additional_cols = []
    return f"""
        DROP MATERIALIZED VIEW IF EXISTS {schema}.{table_name_dest};
        
        CREATE MATERIALIZED VIEW {schema}.{table_name_dest} AS
            SELECT 
                EXTRACT(MONTH FROM post_create_date) AS "month",
                EXTRACT(YEAR FROM post_create_date) AS "year",
                hll_add_agg((hll_hash_text(post_guid))) AS "post_hll",
                hll_add_agg((hll_hash_text(user_guid))) AS "user_hll"
                {''.join([f",{x}" for x in additional_cols])}
            FROM {schema}.{table_name_src}
            GROUP BY year, month{''.join([f",{x}" for x in additional_cols if len(additional_cols) > 0])}
            ORDER BY year ASC, month ASC;
        """
In [33]:
db_connection_hll.rollback()
In [34]:
%%time
destination_table = "flickr_cherries_hll"
origin_table = raw_table_name
sql_query = materialized_view_hll(
    table_name_src=origin_table, table_name_dest=destination_table)
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 271
CPU times: user 0 ns, sys: 1.33 ms, total: 1.33 ms
Wall time: 297 ms

Test:

In [35]:
db_conn.query(f"SELECT * FROM mviews.{destination_table} LIMIT 10;")
Out[35]:
month year post_hll user_hll
0 1.0 1910.0 \x138b4089c1 \x138b405fe1
1 5.0 1920.0 \x138b407a01 \x138b405fe1
2 4.0 1922.0 \x138b40e6c1 \x138b40e283
3 1.0 1947.0 \x138b4074e2 \x138b40ce81
4 1.0 1965.0 \x138b408942 \x138b407847
5 4.0 1965.0 \x138b404fc5 \x138b407381
6 5.0 1968.0 \x138b407703 \x138b4060e1
7 1.0 1970.0 \x138b400323 \x138b40c5a2
8 5.0 1970.0 \x138b400b0271e2 \x138b40dca7
9 3.0 1976.0 \x138b4082e1 \x138b407b02
In [36]:
db_connection_hll.commit()

Export data as CSV

Save hll data to CSV. The following records are available from table spatial.latlng:

  • year - distinct year
  • month - month
  • post_hll - approximate post guids stored as hll set
  • user_hll - approximate user guids stored as hll set
In [37]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll
    FROM mviews.{destination_table};
    """
df = db_conn.query(sql_query)
# use type int instead of float
time_cols = ["year", "month"]
# drop where time cols are invalid 
df.dropna(subset=time_cols, inplace=True)
# turn float to int
for col in time_cols:
    df[col] = df[col].astype(int)
# we can also remove any rows where the year is < 2007
df.drop(df[df['year'] < 2007].index, inplace = True)
In [38]:
df.head()
Out[38]:
year month post_hll user_hll
83 2007 1 \x138b400023038106e109820b210b41124115c116411b... \x138b400ea10f441ba5200128a12a623a233be15c4265...
84 2007 2 \x138b4004a30ae22d612e83308241a14ea25021504253... \x138b4019812262246133413d423d8140a445635dc25f...
85 2007 3 \x148b400040108c0100000100010040110c6000c00000... \x138b400327058206610a220aa30b860ba70cc10ce30f...
86 2007 4 \x148b401046008c2008c02084441880008cc100820108... \x138b40000101610183030103a303e20482050305e106...
87 2007 5 \x138b40010301e304c404e40546066209a10fa2114312... \x138b4000a3042106c115c1182319031a821d241e0121...
In [39]:
usecols = ["year", "month", "post_hll", "user_hll"]
df.to_csv(
    OUTPUT / f"{destination_table}.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Create notebook HTML

In [1]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./08_milvus_conversion.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

IOER RDC Jupyter Base Template v0.10.0