Temporal landscapes: Reddit Data Import ¶

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024

No description has been provided for this image

This notebook transforms collected data into quantitative HyperLogLog data for analysis and archiving. The original raw data can be removed afterwards. See publication:

Dunkel, A., Löchner, M., & Burghardt, D. (2020). Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation. ISPRS International Journal of Geo-Information, 9(10), 607. https://doi.org/10.3390/ijgi9100607

Prepare environment¶

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to to, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d

Load dependencies:

import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML

To reduce the code shown in this notebook, some helper methods are made available in a separate file.

Load helper module from ../py/modules/base/tools.py.

module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools

Activate autoreload of changed python files:

%load_ext autoreload
%autoreload 2

Parameters¶

Define initial parameters that affect processing

WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory                     
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)

for folder in [WORK_DIR, OUTPUT]:
    folder.mkdir(exist_ok=True)

Load dotfiles environment variables

from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)

True

DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB")    # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB")    # lbsn-rawdb name

Raw to HLL conversion¶

db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "lbsn-hlldb"
db_port = "5432"
db_name = "hlldb"

db_connection_hll = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)

Simplify query access:

db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")

If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback() once, to reset the cursor.

db_connection_hll.rollback()

Create Query Schema¶

Create a new schema called mviews and update Postgres search_path, to include new schema:

sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
                  social,
                  spatial,
                  temporal,
                  topical,
                  interlinkage,
                  extensions,
                  mviews;"""

Since the above query will not return any result, we'll directly use the psycopg2 cursor object:

cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)

ALTER DATABASE

By using Foreign Table, this step will establish the connection between hlldb to rawdb.

On hlldb, install postgres_fdw extension:

sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

CREATE EXTENSION

Check if foreign table has been imported already:

sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'mviews'
   AND    table_name   = 'reddit_comments_nationalparks'
   );
"""
result = db_conn_hll.query(sql_query)

result["exists"][0]

True

Conditional load password - this only need to be done once, if the server hasn't been added before.

USER_KEY = None
if not result["exists"][0]:
    import getpass
    USER_KEY = getpass.getpass()

Create Foreign Server connection to rawdb, on hlldb:

sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw 
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
    host '{DB_NAME_RAWDB}',
    dbname '{DB_HOST_RAWDB}',
    port '5432',
    keepalives '1',
    keepalives_idle '30',
    keepalives_interval '10',
    keepalives_count '5',
    fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
    SERVER lbsnraw 
    OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

Optional cleanup step:

cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_all_months")
cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_comments_all_months")
cur_hll.execute("DROP FOREIGN TABLE IF EXISTS mviews.reddit_comments_nationalparks")
cur_hll.execute("DROP FOREIGN TABLE IF EXISTS mviews.reddit_nationalparks")

result["exists"][0]Import foreign table definition on the hlldb.

sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        reddit_comments_nationalparks, reddit_nationalparks)
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

test

db_conn.query("SELECT * FROM mviews.reddit_nationalparks LIMIT 10;")

db_conn.query("SELECT * FROM mviews.reddit_comments_nationalparks LIMIT 10;")

Commit changes to hlldb

db_connection_hll.commit()

Prepare conversion of raw data to hll¶

HyperLogLog parameters

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")

Aggregation step: Convert data to Hll¶

cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_all_months;")
cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_comments_all_months;")

%%time
sql_query = f"""
CREATE MATERIALIZED VIEW mviews.reddit_all_months AS
    SELECT 
        month,
        year,
        hll_add_agg((hll_hash_text(post_guid))) AS "post_hll",
        hll_add_agg((hll_hash_text(user_guid))) AS "user_hll",
        topic_group
    FROM mviews.reddit_nationalparks
    GROUP BY year,month,topic_group
    ORDER BY
    year ASC,
    month ASC,
    topic_group;
    """
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

SELECT 2447
CPU times: user 915 µs, sys: 399 µs, total: 1.31 ms
Wall time: 226 ms

Test:

db_conn.query("SELECT * FROM mviews.reddit_all_months LIMIT 10;")

Repeat for comments. Here, we need to join (with submission), otherwise not all comments will have year and date values populated.

TODO: Some entries appear to be malformed (null for year, month and reaction_guid/post_guid). These are excluded below.

%%time
sql_query = f"""
CREATE MATERIALIZED VIEW mviews.reddit_comments_all_months AS
    SELECT 
        month,
        year,
        hll_add_agg((hll_hash_text(reaction_guid))) AS "post_hll",
        hll_add_agg((hll_hash_text(user_guid))) AS "user_hll",
        hll_add_agg((hll_hash_text(post_guid))) AS "referenced_post_hll",
        topic_group
    FROM mviews.reddit_comments_nationalparks
    WHERE year IS NOT NULL
    GROUP BY year,month,topic_group
    ORDER BY
    year ASC,
    month ASC,
    topic_group;
    """
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

SELECT 2220
CPU times: user 0 ns, sys: 1.41 ms, total: 1.41 ms
Wall time: 1.12 s

db_connection_hll.commit()

Export data as CSV¶

Save hll data to CSV. The following records are available from table spatial.latlng:

year - distinct year
month - month
post_hll - approximate post guids stored as hll set
user_hll - approximate user guids stored as hll set

sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll,
            topic_group
    FROM mviews.reddit_all_months;
    """
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
    df[col] = df[col].astype(int)

df.head()

usecols = ["year", "month", "post_hll", "user_hll", "topic_group"]
df.to_csv(
    OUTPUT / "reddit_all_months.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Repeat for comments:

sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll,
            referenced_post_hll,
            topic_group
    FROM mviews.reddit_comments_all_months;
    """
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
    df[col] = df[col].fillna(-1).astype(int)
usecols = ["year", "month", "post_hll", "user_hll", "referenced_post_hll", "topic_group"]
df.to_csv(
    OUTPUT / "reddit_comments_all_months.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Create notebook HTML¶

!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./04_reddit_privacy.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

	post_guid	user_guid	month	year	topic_group
0	u6LMy2RvI63EuMJAobikLVUlFG1appRRTWgTggU4Jc8	jnQWV/Eh03nol0B70Bj1kmoN/70ZgoyoBYi042MfpgE	8.0	2020.0	everglades
1	ZizMBTiWGUHaZYaj5gR31G5nMLrNgFD3MbWJwM0IOGI	0fDoD91AaIFBmvfcDGdWe+8TA1bR54LdxZouVAnii+o	10.0	2022.0	shenandoah
2	9IAgHueGViF5154WC00lNf3w+YkIM8RsrntXV7dAefQ	b7yoyrvtsgPY+g3MwUtVELjnbGrIHLlOrx3Ls4V7h2c	8.0	2022.0	yosemite
3	dBbOjZh4wA1s8DqBz7xleljR10mdkrJSu6BVXzg1+zI	JEksCG13v1BDv+OQJkZLAzsQ9LAsVCzHPWT7o3VYMLM	10.0	2021.0	shenandoahpark
4	rLTkMjJNmsOfcdBS1DM0DGFb91JXqp/zV9Rr6+pOFJw	bFbmf7bwbdT5j9Osv3Ug+2MUV2oIVKmsGjB3j5gf/+E	12.0	2017.0	yosemite
5	wLwTJvCzAshEgW70GZ0mPIm2W8V438yICCm8Z+W2bPw	bFbmf7bwbdT5j9Osv3Ug+2MUV2oIVKmsGjB3j5gf/+E	11.0	2017.0	yosemite
6	MLVcMIPQdAl31NIlZIdAIhWE4wCzuATlTBB7LyQDUZw	/rhonLKyQdfxGVQIQ+OTHJIoA52U69QG0JyEGkXEASg	6.0	2019.0	zionnationalpark
7	5du6h+UT8sCWcHlkzpWXJFl/N1d7TD0HGGhzxxiEj2E	nDvEvOqF+CunYLuh4ZsvECNCu/vwom8YVEVRsdB5v3M	5.0	2021.0	glaciernationalpark
8	l4M0L/cAd88nGFuuDRZwDE5N93DMAwku3dbZlAL6YlI	BCvbpiZGg0krhg4u2GDK4oRbraRo51/GsFLqw98AdCk	5.0	2018.0	glacier
9	KzrcC4fVq9bob5gh/5wQVV/LId7/ladTeXSE+Znezqo	x1ZABc3rDCwM3JNJH9jCkYQQM9ynZVOKUtZapGRoHa0	1.0	2023.0	craterlake

	reaction_guid	user_guid	post_guid	month	year	topic_group
0	JCC3lr4wfS2LrKmUjhZuRsfWU5EWt9qOVAgthMrydqk	ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8	+WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI	9.0	2018.0	glacier
1	MM77BsjROOJ3SrMe3Fp/94804/mLMPEohSIiMYoLHZw	ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8	+WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI	9.0	2018.0	glacier
2	BYir56yywCJ1O+iKWrdfgwZhZJWZxMFiTGdEuZxENzE	ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8	+WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI	9.0	2018.0	glacier
3	xz9Tl8fG/yp9e4mTop4byY1+mQmSW7gKcaju6P4N/U8	ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8	+WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI	9.0	2018.0	glacier
4	C+7NrjCr3X1c/M7nocgtZjJAPUinVIPW86oBx68/xqs	ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8	+WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI	9.0	2018.0	glacier
5	g3N8Zw2TraT0Vylq3Z4+9cFZcs43WB+TGngrRmSLlww	ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8	+WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI	9.0	2018.0	glacier
6	FUc3WkwKo8ZzLkRxq4aAPT9K6uczsQ0JGtA0cp9piEg	AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM	IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E	9.0	2018.0	glacier
7	h/wAD1EbA2WxEhkqLfE9c8dWUrsbrFiEEuNECy9OZjE	AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM	IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E	9.0	2018.0	glacier
8	vKindfySoQepwqPH+/fOVJx6NtptQDn3YcXLN8qKk9M	AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM	IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E	9.0	2018.0	glacier
9	trdIPaoy6CdxIpU+a3DrHhClxSyPEA4asdJRcsLMln8	AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM	IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E	9.0	2018.0	glacier

	month	year	post_hll	user_hll	topic_group
0	10.0	2010.0	\x138b40c8a2	\x138b40c303	everglades
1	2.0	2011.0	\x138b40fd82	\x138b40c303	everglades
2	2.0	2011.0	\x138b40014238a147c16a8185618a648c83d3c1efc2	\x138b4018214861750197a1c303dee1e021	yosemite
3	3.0	2011.0	\x138b409481	\x138b405e44	yosemite
4	4.0	2011.0	\x138b4020c177c27be294a3	\x138b4076a197a1c303	yosemite
5	5.0	2011.0	\x138b4090e5c1e4	\x138b404961c303	grandcanyon
6	5.0	2011.0	\x138b4011815c826621a341bf81d505e1a2fc43	\x138b4036c3c303	hotsprings
7	5.0	2011.0	\x138b400d4212c19cc49e41ae64af41	\x138b4003e21ac25e44c303dee1	yosemite
8	6.0	2011.0	\x138b4004c316e42782286131013722488281e4b084c0...	\x138b4008a10f835e448062c303dca1e1a2ffa2	yosemite
9	7.0	2011.0	\x138b403081	\x138b405543	hotsprings

	year	month	post_hll	user_hll	topic_group
0	2010	10	\x138b40c8a2	\x138b40c303	everglades
1	2011	2	\x138b40fd82	\x138b40c303	everglades
2	2011	2	\x138b40014238a147c16a8185618a648c83d3c1efc2	\x138b4018214861750197a1c303dee1e021	yosemite
3	2011	3	\x138b409481	\x138b405e44	yosemite
4	2011	4	\x138b4020c177c27be294a3	\x138b4076a197a1c303	yosemite