Temporal landscapes: Reddit Data Import ¶
Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)
Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024
This notebook transforms collected data into quantitative HyperLogLog data for analysis and archiving. The original raw data can be removed afterwards. See publication:
Dunkel, A., Löchner, M., & Burghardt, D. (2020). Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation. ISPRS International Journal of Geo-Information, 9(10), 607. https://doi.org/10.3390/ijgi9100607
Prepare environment¶
To run this notebook, as a starting point, you have two options:
1. Create an environment with the packages and versions shown in the following cell.
As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.
2. If docker is available to to, we suggest to use the Carto-Lab Docker Container
Clone the repository and edit your .env
value to point to the repsitory, where this notebook can be found, e.g.:
git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d
Load dependencies:
import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML
To reduce the code shown in this notebook, some helper methods are made available in a separate file.
Load helper module from ../py/modules/base/tools.py
.
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules.base import tools
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Parameters¶
Define initial parameters that affect processing
WORK_DIR = Path.cwd().parents[0] / "tmp" # Working directory
OUTPUT = Path.cwd().parents[0] / "out" # Define path to output directory (figures etc.)
for folder in [WORK_DIR, OUTPUT]:
folder.mkdir(exist_ok=True)
Load dotfiles environment variables
from dotenv import load_dotenv
load_dotenv(
Path.cwd().parents[0] / '.env', override=True)
DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB") # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB") # lbsn-rawdb name
Raw to HLL conversion¶
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "lbsn-hlldb"
db_port = "5432"
db_name = "hlldb"
db_connection_hll = psycopg2.connect(
host=db_host,
port=db_port,
dbname=db_name,
user=db_user,
password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
Simplify query access:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Create Query Schema¶
Create a new schema called mviews and update Postgres search_path, to include new schema:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
social,
spatial,
temporal,
topical,
interlinkage,
extensions,
mviews;"""
Since the above query will not return any result, we'll directly use the psycopg2 cursor object:
cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
By using Foreign Table, this step will establish the connection between hlldb to rawdb.
On hlldb, install postgres_fdw extension:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Check if foreign table has been imported already:
sql_query = f"""
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'mviews'
AND table_name = 'reddit_comments_nationalparks'
);
"""
result = db_conn_hll.query(sql_query)
result["exists"][0]
Conditional load password - this only need to be done once, if the server hasn't been added before.
USER_KEY = None
if not result["exists"][0]:
import getpass
USER_KEY = getpass.getpass()
Create Foreign Server connection to rawdb, on hlldb:
sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
host '{DB_NAME_RAWDB}',
dbname '{DB_HOST_RAWDB}',
port '5432',
keepalives '1',
keepalives_idle '30',
keepalives_interval '10',
keepalives_count '5',
fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
SERVER lbsnraw
OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
if not result["exists"][0]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Optional cleanup step:
result["exists"][0]Import foreign table definition on the hlldb.
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
LIMIT TO (
reddit_comments_nationalparks, reddit_nationalparks)
FROM SERVER lbsnraw
INTO mviews;
"""
# only import table
# if it hasn't been imported already
if not result["exists"][0]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
test
db_conn.query("SELECT * FROM mviews.reddit_nationalparks LIMIT 10;")
db_conn.query("SELECT * FROM mviews.reddit_comments_nationalparks LIMIT 10;")
Commit changes to hlldb
db_connection_hll.commit()
Prepare conversion of raw data to hll¶
HyperLogLog parameters
The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.
From a privacy perspective, for example, it is recommended to disable explicit mode.
Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.
To disable explicit mode:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Aggregation step: Convert data to Hll¶
cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_all_months;")
cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_comments_all_months;")
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW mviews.reddit_all_months AS
SELECT
month,
year,
hll_add_agg((hll_hash_text(post_guid))) AS "post_hll",
hll_add_agg((hll_hash_text(user_guid))) AS "user_hll",
topic_group
FROM mviews.reddit_nationalparks
GROUP BY year,month,topic_group
ORDER BY
year ASC,
month ASC,
topic_group;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Test:
db_conn.query("SELECT * FROM mviews.reddit_all_months LIMIT 10;")
Repeat for comments. Here, we need to join (with submission), otherwise not all comments will have year and date values populated.
TODO: Some entries appear to be malformed (null for year, month and reaction_guid/post_guid). These are excluded below.
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW mviews.reddit_comments_all_months AS
SELECT
month,
year,
hll_add_agg((hll_hash_text(reaction_guid))) AS "post_hll",
hll_add_agg((hll_hash_text(user_guid))) AS "user_hll",
hll_add_agg((hll_hash_text(post_guid))) AS "referenced_post_hll",
topic_group
FROM mviews.reddit_comments_nationalparks
WHERE year IS NOT NULL
GROUP BY year,month,topic_group
ORDER BY
year ASC,
month ASC,
topic_group;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
db_connection_hll.commit()
Export data as CSV¶
Save hll data to CSV. The following records are available from table spatial.latlng:
- year - distinct year
- month - month
- post_hll - approximate post guids stored as hll set
- user_hll - approximate user guids stored as hll set
sql_query = f"""
SELECT year,
month,
post_hll,
user_hll,
topic_group
FROM mviews.reddit_all_months;
"""
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
df[col] = df[col].astype(int)
df.head()
usecols = ["year", "month", "post_hll", "user_hll", "topic_group"]
df.to_csv(
OUTPUT / "reddit_all_months.csv",
mode='w', columns=usecols,
index=False, header=True)
Repeat for comments:
sql_query = f"""
SELECT year,
month,
post_hll,
user_hll,
referenced_post_hll,
topic_group
FROM mviews.reddit_comments_all_months;
"""
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
df[col] = df[col].fillna(-1).astype(int)
usecols = ["year", "month", "post_hll", "user_hll", "referenced_post_hll", "topic_group"]
df.to_csv(
OUTPUT / "reddit_comments_all_months.csv",
mode='w', columns=usecols,
index=False, header=True)
Create notebook HTML¶
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./04_reddit_privacy.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-