Temporal landscapes: Reddit Data Import

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024

No description has been provided for this image
•••
Out[3]:

Last updated: Jul-18-2024, Carto-Lab Docker Version 0.14.0

This notebook transforms collected data into quantitative HyperLogLog data for analysis and archiving. The original raw data can be removed afterwards. See publication:

Dunkel, A., Löchner, M., & Burghardt, D. (2020). Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation. ISPRS International Journal of Geo-Information, 9(10), 607. https://doi.org/10.3390/ijgi9100607

Prepare environment

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to to, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://gitlab.vgiscience.de/lbsn/tools/jupyterlab.git
cd jupyterlab
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker-compose pull && docker-compose up -d
•••
List of package versions used in this notebook
package python Fiona Shapely bokeh colorcet geopandas geoviews holoviews hvplot ipywidgets
version 3.9.15 1.8.20 1.7.1 2.4.3 3.0.1 0.13.2 1.9.5 1.14.8 0.8.4 8.0.7
package mapclassify matplotlib matplotlib-venn numpy pandas python-dotenv xarray
version 2.5.0 3.7.1 0.11.9 1.22.4 2.0.3 1.0.0 2023.6.0

Load dependencies:

In [5]:
import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML

To reduce the code shown in this notebook, some helper methods are made available in a separate file.

Load helper module from ../py/modules/base/tools.py.

In [6]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools

Activate autoreload of changed python files:

In [7]:
%load_ext autoreload
%autoreload 2

Parameters

Define initial parameters that affect processing

In [8]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory                     
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
In [9]:
for folder in [WORK_DIR, OUTPUT]:
    folder.mkdir(exist_ok=True)

Load dotfiles environment variables

In [10]:
from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)
Out[10]:
True
In [11]:
DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB")    # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB")    # lbsn-rawdb name

Raw to HLL conversion

In [12]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "lbsn-hlldb"
db_port = "5432"
db_name = "hlldb"
In [ ]:
db_connection_hll = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)

Simplify query access:

In [12]:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Out[12]:
?column?
0 1
If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback() once, to reset the cursor.
db_connection_hll.rollback()

Create Query Schema

Create a new schema called mviews and update Postgres search_path, to include new schema:

In [14]:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
                  social,
                  spatial,
                  temporal,
                  topical,
                  interlinkage,
                  extensions,
                  mviews;"""

Since the above query will not return any result, we'll directly use the psycopg2 cursor object:

In [15]:
cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
ALTER DATABASE

By using Foreign Table, this step will establish the connection between hlldb to rawdb.

On hlldb, install postgres_fdw extension:

In [16]:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE EXTENSION

Check if foreign table has been imported already:

In [17]:
sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'mviews'
   AND    table_name   = 'reddit_comments_nationalparks'
   );
"""
result = db_conn_hll.query(sql_query)
In [18]:
result["exists"][0]
Out[18]:
True

Conditional load password - this only need to be done once, if the server hasn't been added before.

In [19]:
USER_KEY = None
if not result["exists"][0]:
    import getpass
    USER_KEY = getpass.getpass()

Create Foreign Server connection to rawdb, on hlldb:

In [20]:
sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw 
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
    host '{DB_NAME_RAWDB}',
    dbname '{DB_HOST_RAWDB}',
    port '5432',
    keepalives '1',
    keepalives_idle '30',
    keepalives_interval '10',
    keepalives_count '5',
    fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
    SERVER lbsnraw 
    OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

Optional cleanup step:

cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_all_months") cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_comments_all_months") cur_hll.execute("DROP FOREIGN TABLE IF EXISTS mviews.reddit_comments_nationalparks") cur_hll.execute("DROP FOREIGN TABLE IF EXISTS mviews.reddit_nationalparks")

result["exists"][0]Import foreign table definition on the hlldb.

In [21]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        reddit_comments_nationalparks, reddit_nationalparks)
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

test

In [22]:
db_conn.query("SELECT * FROM mviews.reddit_nationalparks LIMIT 10;")
Out[22]:
post_guid user_guid month year topic_group
0 u6LMy2RvI63EuMJAobikLVUlFG1appRRTWgTggU4Jc8 jnQWV/Eh03nol0B70Bj1kmoN/70ZgoyoBYi042MfpgE 8.0 2020.0 everglades
1 ZizMBTiWGUHaZYaj5gR31G5nMLrNgFD3MbWJwM0IOGI 0fDoD91AaIFBmvfcDGdWe+8TA1bR54LdxZouVAnii+o 10.0 2022.0 shenandoah
2 9IAgHueGViF5154WC00lNf3w+YkIM8RsrntXV7dAefQ b7yoyrvtsgPY+g3MwUtVELjnbGrIHLlOrx3Ls4V7h2c 8.0 2022.0 yosemite
3 dBbOjZh4wA1s8DqBz7xleljR10mdkrJSu6BVXzg1+zI JEksCG13v1BDv+OQJkZLAzsQ9LAsVCzHPWT7o3VYMLM 10.0 2021.0 shenandoahpark
4 rLTkMjJNmsOfcdBS1DM0DGFb91JXqp/zV9Rr6+pOFJw bFbmf7bwbdT5j9Osv3Ug+2MUV2oIVKmsGjB3j5gf/+E 12.0 2017.0 yosemite
5 wLwTJvCzAshEgW70GZ0mPIm2W8V438yICCm8Z+W2bPw bFbmf7bwbdT5j9Osv3Ug+2MUV2oIVKmsGjB3j5gf/+E 11.0 2017.0 yosemite
6 MLVcMIPQdAl31NIlZIdAIhWE4wCzuATlTBB7LyQDUZw /rhonLKyQdfxGVQIQ+OTHJIoA52U69QG0JyEGkXEASg 6.0 2019.0 zionnationalpark
7 5du6h+UT8sCWcHlkzpWXJFl/N1d7TD0HGGhzxxiEj2E nDvEvOqF+CunYLuh4ZsvECNCu/vwom8YVEVRsdB5v3M 5.0 2021.0 glaciernationalpark
8 l4M0L/cAd88nGFuuDRZwDE5N93DMAwku3dbZlAL6YlI BCvbpiZGg0krhg4u2GDK4oRbraRo51/GsFLqw98AdCk 5.0 2018.0 glacier
9 KzrcC4fVq9bob5gh/5wQVV/LId7/ladTeXSE+Znezqo x1ZABc3rDCwM3JNJH9jCkYQQM9ynZVOKUtZapGRoHa0 1.0 2023.0 craterlake
In [23]:
db_conn.query("SELECT * FROM mviews.reddit_comments_nationalparks LIMIT 10;")
Out[23]:
reaction_guid user_guid post_guid month year topic_group
0 JCC3lr4wfS2LrKmUjhZuRsfWU5EWt9qOVAgthMrydqk ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8 +WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI 9.0 2018.0 glacier
1 MM77BsjROOJ3SrMe3Fp/94804/mLMPEohSIiMYoLHZw ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8 +WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI 9.0 2018.0 glacier
2 BYir56yywCJ1O+iKWrdfgwZhZJWZxMFiTGdEuZxENzE ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8 +WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI 9.0 2018.0 glacier
3 xz9Tl8fG/yp9e4mTop4byY1+mQmSW7gKcaju6P4N/U8 ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8 +WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI 9.0 2018.0 glacier
4 C+7NrjCr3X1c/M7nocgtZjJAPUinVIPW86oBx68/xqs ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8 +WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI 9.0 2018.0 glacier
5 g3N8Zw2TraT0Vylq3Z4+9cFZcs43WB+TGngrRmSLlww ESMdDvo+921msoMilJuBLi/j7Qgusd7dKetYSfizDZ8 +WVIFvN8uZBkqWegU6QqpIwH4aboorxPQE2tgfslgFI 9.0 2018.0 glacier
6 FUc3WkwKo8ZzLkRxq4aAPT9K6uczsQ0JGtA0cp9piEg AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E 9.0 2018.0 glacier
7 h/wAD1EbA2WxEhkqLfE9c8dWUrsbrFiEEuNECy9OZjE AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E 9.0 2018.0 glacier
8 vKindfySoQepwqPH+/fOVJx6NtptQDn3YcXLN8qKk9M AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E 9.0 2018.0 glacier
9 trdIPaoy6CdxIpU+a3DrHhClxSyPEA4asdJRcsLMln8 AWPi6ULcoc+/KpM0E+d5MFiwgayztNOkRmAuzj9hfSM IjPrSkAFBpoWPAXJmfdnUU1QmhXfNINR8aHHDi4Ki3E 9.0 2018.0 glacier

Commit changes to hlldb

In [24]:
db_connection_hll.commit()

Prepare conversion of raw data to hll

HyperLogLog parameters

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

In [25]:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Out[25]:
hll_set_defaults
0 (11,5,-1,1)

Aggregation step: Convert data to Hll

In [44]:
cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_all_months;")
cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.reddit_comments_all_months;")
In [45]:
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW mviews.reddit_all_months AS
    SELECT 
        month,
        year,
        hll_add_agg((hll_hash_text(post_guid))) AS "post_hll",
        hll_add_agg((hll_hash_text(user_guid))) AS "user_hll",
        topic_group
    FROM mviews.reddit_nationalparks
    GROUP BY year,month,topic_group
    ORDER BY
    year ASC,
    month ASC,
    topic_group;
    """
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 2447
CPU times: user 915 µs, sys: 399 µs, total: 1.31 ms
Wall time: 226 ms

Test:

In [46]:
db_conn.query("SELECT * FROM mviews.reddit_all_months LIMIT 10;")
Out[46]:
month year post_hll user_hll topic_group
0 10.0 2010.0 \x138b40c8a2 \x138b40c303 everglades
1 2.0 2011.0 \x138b40fd82 \x138b40c303 everglades
2 2.0 2011.0 \x138b40014238a147c16a8185618a648c83d3c1efc2 \x138b4018214861750197a1c303dee1e021 yosemite
3 3.0 2011.0 \x138b409481 \x138b405e44 yosemite
4 4.0 2011.0 \x138b4020c177c27be294a3 \x138b4076a197a1c303 yosemite
5 5.0 2011.0 \x138b4090e5c1e4 \x138b404961c303 grandcanyon
6 5.0 2011.0 \x138b4011815c826621a341bf81d505e1a2fc43 \x138b4036c3c303 hotsprings
7 5.0 2011.0 \x138b400d4212c19cc49e41ae64af41 \x138b4003e21ac25e44c303dee1 yosemite
8 6.0 2011.0 \x138b4004c316e42782286131013722488281e4b084c0... \x138b4008a10f835e448062c303dca1e1a2ffa2 yosemite
9 7.0 2011.0 \x138b403081 \x138b405543 hotsprings

Repeat for comments. Here, we need to join (with submission), otherwise not all comments will have year and date values populated.

TODO: Some entries appear to be malformed (null for year, month and reaction_guid/post_guid). These are excluded below.

In [47]:
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW mviews.reddit_comments_all_months AS
    SELECT 
        month,
        year,
        hll_add_agg((hll_hash_text(reaction_guid))) AS "post_hll",
        hll_add_agg((hll_hash_text(user_guid))) AS "user_hll",
        hll_add_agg((hll_hash_text(post_guid))) AS "referenced_post_hll",
        topic_group
    FROM mviews.reddit_comments_nationalparks
    WHERE year IS NOT NULL
    GROUP BY year,month,topic_group
    ORDER BY
    year ASC,
    month ASC,
    topic_group;
    """
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 2220
CPU times: user 0 ns, sys: 1.41 ms, total: 1.41 ms
Wall time: 1.12 s
In [48]:
db_connection_hll.commit()

Export data as CSV

Save hll data to CSV. The following records are available from table spatial.latlng:

  • year - distinct year
  • month - month
  • post_hll - approximate post guids stored as hll set
  • user_hll - approximate user guids stored as hll set
In [49]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll,
            topic_group
    FROM mviews.reddit_all_months;
    """
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
    df[col] = df[col].astype(int)
In [50]:
df.head()
Out[50]:
year month post_hll user_hll topic_group
0 2010 10 \x138b40c8a2 \x138b40c303 everglades
1 2011 2 \x138b40fd82 \x138b40c303 everglades
2 2011 2 \x138b40014238a147c16a8185618a648c83d3c1efc2 \x138b4018214861750197a1c303dee1e021 yosemite
3 2011 3 \x138b409481 \x138b405e44 yosemite
4 2011 4 \x138b4020c177c27be294a3 \x138b4076a197a1c303 yosemite
In [51]:
usecols = ["year", "month", "post_hll", "user_hll", "topic_group"]
df.to_csv(
    OUTPUT / "reddit_all_months.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Repeat for comments:

In [52]:
sql_query = f"""
    SELECT  year,
            month,
            post_hll,
            user_hll,
            referenced_post_hll,
            topic_group
    FROM mviews.reddit_comments_all_months;
    """
df = db_conn.query(sql_query)
# use type int instead of float
for col in ["year", "month"]:
    df[col] = df[col].fillna(-1).astype(int)
usecols = ["year", "month", "post_hll", "user_hll", "referenced_post_hll", "topic_group"]
df.to_csv(
    OUTPUT / "reddit_comments_all_months.csv",
    mode='w', columns=usecols,
    index=False, header=True)

Create notebook HTML

In [1]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./04_reddit_privacy.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]:
 

IOER RDC Jupyter Base Template v0.10.0