Digital Landscape Traces: Inferring User Origin (Local vs. Tourist)

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

No description has been provided for this image
•••
Out[4]:

Last updated: Sep-29-2025, Carto-Lab Docker Version 1.0.1

This notebook documents the data preparation process for differentiating between "locals" and "tourists" based on their social media activity. The primary goal is to determine a probable home location for each user who has posted from within Germany. This allows us to classify them and subsequently analyze how these different groups interact with and perceive German landscapes.

The workflow involves several key stages:

  1. Identify Active Users: We begin by isolating the 6.3 million users who have posted at least once from Germany out of a global dataset of 127.6 million users.
  2. Gather Global Activity: For this cohort of users, we retrieve the locations of all their posts worldwide. A user's complete spatial footprint is necessary to reliably infer a home location.
  3. Home Location Inference: We then process this location history. By clustering the spatiotemporal data (specifically, "user-days," which represent a unique day a user was active in a location), we can identify the most probable home location for each individual.
  4. Classification: Based on whether a user's inferred home is inside or outside Germany, we classify them as a "local" or a "tourist." Users with insufficient data or ambiguous patterns are classified as "undetermined."
  5. Privacy-Preserving Export: Finally, the classified user data is aggregated and exported. To protect user privacy while enabling reproducibility, we use privacy-enhancing techniques, such as HyperLogLog (HLL), to create a dataset that supports aggregated mapping and analysis without exposing individual-level data.

Prepare environment

To run this notebook, as a starting point, you have two options:

1. Create an environment with the packages and versions shown in the following cell.

As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.

2. If docker is available to to, we suggest to use the Carto-Lab Docker Container

Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:

git clone https://github.com/ioer-dresden/carto-lab-docker.git
cd carto-lab-docker
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker compose pull && docker compose up -d
•••
List of package versions used in this notebook
package python xarray ipywidgets shapely mapclassify memory_profiler python-dotenv numpy bokeh matplotlib-venn
version 3.9.23 2024.7.0 8.1.7 2.0.7 2.8.1 0.61.0 1.1.1 2.0.2 3.4.2 1.1.2
package matplotlib pandas hvplot colorcet holoviews geopandas geoviews
version 3.9.4 2.3.1 0.11.3 3.1.0 1.20.2 1.0.1 1.12.0

Load dependencies:

In [6]:
import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML

To reduce the code shown in this notebook, some helper methods are made available in a separate file.

Load helper module from ../py/modules/base/tools.py.

In [7]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools

Activate autoreload of changed python files:

In [8]:
%load_ext autoreload
%autoreload 2

Parameters

Define initial parameters that affect processing

A 6-character geohash is aproximate to a 5 km × 5 km grid. That is precise enough to distinguish neighborhoods but not individual buildings. Reduce to increase privacy. Increase to improve granularity.

In [9]:
# Initial accuracy to reduce granularity 
# of spatial data on hll convert
GEOHASH_PRECISION = 5
In [10]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory                     
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
In [11]:
for folder in [WORK_DIR, OUTPUT]:
    folder.mkdir(exist_ok=True)

Load dotfiles environment variables

In [12]:
from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)
Out[12]:
False
In [13]:
DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB")    # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB")    # lbsn-rawdb name

Initial Query for DE posts

First, create a table with the Geometry for Germany
CREATE TABLE spatial.germany_boundary (name VARCHAR(255), geom GEOMETRY(Polygon, 4326));
INSERT INTO spatial.germany_boundary (name, geom) VALUES ('Germany', ST_SetSRID(ST_GeomFromText(
'SRID=4326; Polygon ((9.1433304674271767 54.87302263558621007, 9.29480994002676653 54.80168366606841346, 9.59188580908289623 54.8869844880437654, 9.89410309238382979 54.84183198146081395, 10.59215134630761668 54.51989772869728768, 11.14133007712143808 54.57629296235040783, 11.64437791630786023 54.33073223410536912, 12.16520515899094335 54.38463668923185423, 12.32216612986303517 54.57714431725651139, 13.06899846071041793 54.84973594845530442, 13.42317610499327429 54.90912276250554669, 13.6613590011170345 54.86449207854661836, 14.05716348310829744 54.61181819135651949, 14.06966560638267083 54.27763627849151362, 14.16803558587361067 54.23880270088397992, 14.24221454696379396 53.98771964604486584, 14.18527042109309377 53.91197215758904804, 14.45052538304003065 53.26225137859830738, 14.34852597989009837 53.05471971874010251, 14.1436577746865737 52.96136543546651865, 14.12288246055715035 52.83765843329013023, 14.63909584120017371 52.57299936662752771, 14.53435722180836365 52.39500777578825819, 14.57599486577049674 52.28835922389913549, 14.71568838750914665 52.23588560993300689, 14.68153710045686466 52.11665381861888591, 14.75899936762675679 52.06476543577628036, 14.59014437256479368 51.82100998772079947, 14.75750455139399264 51.66150559111019902, 14.72911862642354208 51.53143962124158861, 14.97434511713350958 51.44213090221938955, 15.04181552066978611 51.27421588035866762, 14.9793136600973753 51.07703822223624002, 14.76638305793181871 50.81933252297761783, 14.61880015620226914 50.85780455873344863, 14.65014207539650215 50.9315402808593376, 14.56403797087568819 50.91849758409625792, 14.59917754360878916 50.98711099987281159, 14.50821195321913137 51.04315459410929634, 14.30164070725288639 51.05495050765983933, 14.25854303463341921 50.98751826955253819, 14.3876350063505356 50.89908797413073671, 13.90038356464981462 50.79324026222838029, 13.85526158645308215 50.72711999918868742, 13.55215096069832725 50.71383131086633966, 13.46562195618838587 50.60232153031444113, 13.37098295180180685 50.65054589437755794, 13.19525624507969042 50.50326319983685153, 13.03213105020785179 50.50996067954179125, 12.948144730395029 50.404311645865846, 12.81939501546764859 50.4597950326146929, 12.51204793536396664 50.39725816862960528, 12.33108448895802667 50.24271404931397456, 12.33320130239820855 50.17155191349007737, 12.18445310862409769 50.32232471681226116, 12.10019899948429156 50.31832794437917755, 12.2014080343533351 50.10849193455516115, 12.54782478838330917 49.92026953740929684, 12.40074313458802635 49.75553183832374771, 12.52185046292856896 49.6865330432653991, 12.65587915330036139 49.43453810554916572, 13.02921064395610529 49.30451982223206642, 13.40289555288751444 48.9873427339800287, 13.62827657000775616 48.94930714460906529, 13.83973263295467859 48.77148568211856627, 13.82569624828329324 48.61841526565206095, 13.73062545330208195 48.51446092253360121, 13.50879208898379602 48.59022953606427109, 13.32976817266307989 48.32359206612102298, 12.75814504086099532 48.12619261719555652, 13.00253663695627893 47.85328766219589625, 12.90516623683242869 47.72360115407008152, 13.08022144406442067 47.68711627559571298, 13.04750486166005885 47.49208302571111062, 12.80384823830559071 47.54988690075992963, 12.78106185032419262 47.67414247229731927, 12.49888118252061986 47.62494554180335626, 12.44012646416501866 47.69517254637594306, 12.2551562061939876 47.67929967054340068, 12.25707210839112804 47.74283907176578623, 12.16257153215053677 47.70112119928398897, 12.20320083824810808 47.60655540780589945, 11.63551183390404731 47.5944925556869407, 11.2728918533763931 47.39785256984026773, 10.9703287861445915 47.40002058258386342, 10.89015587599294577 47.53719340445212538, 10.45445953557913299 47.55573834235065078, 10.43647213957763142 47.38064780345342797, 10.17419188314892153 47.27020119333047887, 10.23594858525905238 47.38181776567739689, 10.09978048238338033 47.35473600246461956, 10.09136801981833287 47.45875007585539151, 9.97084547833622992 47.54568017799789459, 9.77728950610588754 47.59490289205490399, 9.56184471246353951 47.50455693005835656, 9.25583223976610725 47.65909750244782117, 8.89535179764737904 47.64909162917541607, 8.80644066260651925 47.73826225010282087, 8.79566886012986515 47.675586000193789, 8.72795487162005657 47.69268014007644751, 8.65697507693829493 47.80034710313634605, 8.40439955129829741 47.69805376287092713, 8.62888483466741718 47.65174054269192538, 8.43714312685386858 47.56708343205480105, 8.20607236264947915 47.62099989582856097, 7.67923753718264379 47.53277577987836366, 7.63407099577989356 47.5611040421179041, 7.69355335740158353 47.60053205810805821, 7.60448410715588352 47.57779110297707348, 7.52107927548820498 47.66384788966843189, 7.6221212397035174 47.97279107394894027, 7.5769816401905814 48.11905638794200968, 7.74340888840953312 48.32546477927621709, 7.80214138631595233 48.58884316637909251, 8.23263282034844224 48.96657144521606142, 7.93564809218577238 49.05765511479177121, 7.63120134636511693 49.0548857056109, 7.44564667835157934 49.18415102884694079, 7.29317690469042645 49.11490732861039987, 7.05244591624608574 49.11275784369956909, 6.93810712304531307 49.22243763608810241, 6.73839761345914212 49.16359779690088772, 6.55081584387613702 49.42537952404586576, 6.36703072508794321 49.4694830820786251, 6.35758927802282869 49.57347554655927979, 6.52992601637259895 49.80704757898263324, 6.32226166784175803 49.8392608441127436, 6.11309021923830187 50.06076301060448941, 6.17568740345097922 50.23541597716564411, 6.40502825514988672 50.32330869294190734, 6.34221872841163758 50.38000552503501694, 6.37487303006633965 50.45124892989423415, 6.1984988852154288 50.52836615777204088, 6.26559804771176232 50.64270602206534022, 5.97480474174545861 50.7979707384588437, 6.09389718580655426 50.92131540468221829, 5.89752511249400868 50.97501346989577087, 5.86687398774438407 51.04838894699236107, 6.17541526680116704 51.15847977272341041, 6.08215105872443473 51.17179911623628641, 6.07213427667556971 51.24245499220415923, 6.22612991836109586 51.36051764507328699, 6.21230003242061457 51.51361357530629448, 6.09123534633420149 51.60590930886758088, 6.11820963876385804 51.65610310637333669, 5.95510381438958802 51.73828468199252484, 5.94515322244330946 51.82451729513712735, 6.1660793929496549 51.8409024535854428, 6.10334125709340469 51.89261697229551373, 6.15511712665414734 51.90541782067595022, 6.40506434674227876 51.82740354243958336, 6.3907900554168009 51.8742190043036544, 6.72166407024070622 51.89617477214807195, 6.82851309630240166 51.96406670771244762, 6.69507834878680796 52.07026212539682319, 7.0609954894503062 52.23478240438453213, 7.07243800091305275 52.37311639220780535, 6.98785513290812332 52.46916016640858516, 6.69772000387513344 52.48634521246162876, 6.6814024699480683 52.55359869772178172, 6.76649011302833969 52.56230612083083997, 6.70973231049777041 52.62782351794106717, 6.75214101519571841 52.6482893402808827, 7.05100158975159275 52.63994356514348283, 7.08700963868392364 52.85051212403533327, 7.21693779014697157 53.0067232392861456, 7.19205451590323719 53.3143070866028097, 6.9316102596058613 53.33411061581233525, 6.88866268113787328 53.44247621270710624, 6.68362781717007692 53.49664088127173045, 6.34585998714874222 53.72450141470858398, 6.72671131827274849 53.86254353925301075, 7.72794138558651866 53.99333519546297566, 7.53412317824211186 54.13912000258608259, 7.58474754686181996 54.31008616784242804, 7.84582089511826553 54.39292727477396738, 8.16128570027490241 54.33250950670844759, 7.93236566924031195 54.73415971779678557, 8.05011711081607473 55.09845306415763844, 8.47261129387618439 55.05409929404726199, 8.55470193184874006 54.92094757397143212, 9.1433304674271767 54.87302263558621007))'
), 4326));


To speed up, create a spatial index:

CREATE INDEX idx_germany_boundary_geom ON spatial.germany_boundary USING GIST (geom);
ANALYZE spatial.germany_boundary;

The original big query to select all posts for Germany is shown below. This must be executed directly via postgres terminal, ideally with a terminal multiplexer.

CREATE MATERIALIZED VIEW mviews.all_posts_de AS
    SELECT
        p.user_guid,
        ST_Y(p.post_latlng) AS latitude,
        ST_X(p.post_latlng) AS longitude
        p.post_create_date,
        p.post_publish_date
    FROM
        topical.post AS p
    JOIN
        spatial.germany_boundary AS de
    ON
        ST_Intersects(p.post_latlng, de.geom)

To also show posts outside of Germany, as surrounding context (without classification and toned down in opcaity), the following query was used:

COPY (
    WITH de_geometries AS (
        SELECT
            geom::geometry AS germany_geom
        FROM
            spatial.germany_boundary
        LIMIT 1
    ),
    expanded_bbox AS (
        SELECT
            ST_Buffer(
                (ST_Envelope(germany_geom))::geography,
                100000
            )::geometry AS buffered_geom,
            germany_geom
        FROM
            de_geometries
    )
    SELECT
        ST_Y(p.post_latlng) AS latitude,
        ST_X(p.post_latlng) AS longitude
    FROM
        topical.post AS p,
        expanded_bbox
    WHERE
        ST_Intersects(p.post_latlng, expanded_bbox.buffered_geom)
        AND NOT ST_Intersects(p.post_latlng, expanded_bbox.germany_geom)
)
TO '/pg_exp/2025-10-18_DE_outside_buffer_coords.csv'
WITH (FORMAT CSV, HEADER, ENCODING 'UTF-8');

Connecting to the Databases

Our setup uses two PostgreSQL databases. The first (lbsn-rawdb) contains the original, sensitive social media data. The second (hlldb) is our local working database, where we will perform calculations and store aggregated, privacy-preserving results using the HyperLogLog (HLL) extension. This separation ensures that the raw data remains isolated and secure.

In [14]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"
In [15]:
db_connection_hll = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
SELECT 1

Simplify query access:

In [16]:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Out[16]:
?column?
0 1
If any SQL results in an error, the cursor cannot be used again. In this case, run db_connection.rollback() once, to reset the cursor.
db_connection_hll.rollback()

Create Query Schema

Create a new schema called mviews and update Postgres search_path parameter, to include new schema:

In [16]:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
                  social,
                  spatial,
                  temporal,
                  topical,
                  interlinkage,
                  extensions,
                  mviews;"""

Since the above query will not return any result, we'll directly use the psycopg2 cursor object:

In [17]:
cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
ALTER DATABASE

To access the raw post data without copying it into our working database, we use a PostgreSQL feature called a Foreign Data Wrapper (FDW). This creates a secure, read-only link from hlldb to the tables in lbsn-rawdb. This approach is highly efficient for large datasets and maintains a clear separation between raw and processed data.

On hlldb, install postgres_fdw extension:

In [18]:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE EXTENSION
In [19]:
MVIEWS_REF = "all_posts_de"

Check if foreign table has been imported already:

In [20]:
sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'mviews'
   AND    table_name   = '{MVIEWS_REF}'
   );
"""
result = db_conn_hll.query(sql_query)
In [21]:
result["exists"][0]
Out[21]:
np.True_

Conditional load password - this only need to be done once, if the server hasn't been added before.

Enter the password for pg user lbsn_reader

In [22]:
USER_KEY = None
if not result["exists"][0]:
    import getpass
    USER_KEY = getpass.getpass()

Create Foreign Server connection to rawdb, on hlldb:

In [23]:
sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw 
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
    host '{DB_NAME_RAWDB}',
    dbname '{DB_HOST_RAWDB}',
    port '5432',
    keepalives '1',
    keepalives_idle '30',
    keepalives_interval '10',
    keepalives_count '5',
    fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
    SERVER lbsnraw 
    OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

Optional cleanup step (deactivated cell):

cur_hll.execute("DROP MATERIALIZED VIEW IF EXISTS mviews.all_posts_de")

If result["exists"][0]: Import foreign table definition on the hlldb.

In [24]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {MVIEWS_REF})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)

test

In [25]:
db_conn.query(f"SELECT * FROM mviews.{MVIEWS_REF} LIMIT 10;")
Out[25]:
origin_id post_guid post_latlng user_guid post_create_date post_publish_date post_thumbnail_url post_views_count post_like_count post_url hashtags emoji post_title post_body post_geoaccuracy post_comment_count post_type post_filter place_guid
0 3 VHk0t41dPrAA83yDvBz2JxFg/u1PI0D0xjJXjOAE/hw 0101000020E610000025C1866DACFA2340C477A4F2DDC6... NkxKboB5Al0wGwYzuyT0+xepLnCeeBY42hgFRp3H8rY None 2016-06-17 02:09:28 None None None https://twitter.com/170210070/statuses/7436265... [alsterhaus, germany, hamburg, mutterland, rai... [] None A rainyday in Hamburg. \n#mutterland #alsterha... latlng None text None None
1 3 MSW12kCKg9YSsKPz1+rgnEvnqkRK/8ecILl40cK54Rs 0101000020E6100000850A0E2F88E82A40810A47904A3B... 1fdkAMexfLycm8IAo0Up3mD4HVt6eP1Q5aCPayJkr/E None 2016-06-17 02:11:28 None None None https://twitter.com/121565208/statuses/7436270... [] [] None I'm at Dreieck Neukölln (25) (1) in Berlin htt... latlng None text None None
2 3 Lv1+fUXS7wBqHzxk+Psx3ngqenrSFYQIPyWT2bytPCY 0101000020E6100000384888F2052D2440FB230C0396C6... Dr5jwbDjoxk1JaDChxL7OlS7bP2d9+O3YMmMNOIGn5E None 2016-06-17 02:14:42 None None None https://twitter.com/743748012/statuses/7436279... [] [😂, 😭] None 😂😭😭😭😭😭😭 Yo People Are Really Dumb Beloved 😭😭😭😭😭 latlng None text None None
3 3 zDzR2yzUqpL7yxByd6c1nis7kunYAlfw1AqXciKUmQo 0101000020E6100000A88E554ACFFC2340B7D3D68860C6... MLdIkxXo7NI1Bf02L9eDK2DoHvRQfB9blo9Faqmpccg None 2016-06-17 02:15:27 None None None https://twitter.com/2834844017/statuses/743628... [hamburg] [] None Niederschlag (Intensität 40%, Fläche 11%) - me... latlng None image None None
4 3 wieoI7kek0r3GBKtnXo2a63poRA5f6e/pYl63XdbL9M 0101000020E6100000D11ED4EA50F62A4083E500A8FF3C... 1fdkAMexfLycm8IAo0Up3mD4HVt6eP1Q5aCPayJkr/E None 2016-06-17 02:16:38 None None None https://twitter.com/121565208/statuses/7436283... [] [] None I'm at Plänterwald in Berlin https://t.co/cyve... latlng None text None None
5 3 jbDI9Sw+CNjUGK+37LZNEt2eR0QrgCdchOCPY+CYBS8 0101000020E6100000C66D3480B70024406666666666C6... 7CsievJjbgYtZNiS+t1SPOW1GbD3/ynJckihYfJDXJ4 None 2016-06-17 02:18:54 None None None https://twitter.com/122542309/statuses/7436289... [BetAwards, IChooseBreezy, cb] [] None June 7: Chris Brown at Hamburger Barclaycard A... latlng None text None None
6 3 omEopmnZiI+HP0XekR5V11ONYU2gVcarRcxo/j7Ql7Q 0101000020E6100000C66D3480B70024406666666666C6... 7CsievJjbgYtZNiS+t1SPOW1GbD3/ynJckihYfJDXJ4 None 2016-06-17 02:19:49 None None None https://twitter.com/122542309/statuses/7436291... [BetAwards, IChooseBreezy, cb] [] None June 7: Chris Brown at Hamburger Barclaycard A... latlng None text None None
7 3 zXpQJbOqT8nq5TicVqxKQP/ajPU0S6ulZVABvdwbrUI 0101000020E61000001B2FDD2406C12A4099BB96900F42... +v/H1/hOSfiCrs5J5BKINpCzYiJpx3+WZEftNXfHxLk None 2016-06-17 02:20:17 None None None https://twitter.com/1336218432/statuses/743629... [Frauentausch, GERPOL, JoCox, Lanz, rawtele5, ... [] None 1. #GERPOL\n2. #Lanz\n3. #JoCox\n4. #rawtele5\... latlng None text None None
8 3 KR0WllQh7UEWwYQfjYBUfIpvspzSPuv/vxVKZiENIdE 0101000020E6100000C66D3480B70024406666666666C6... 7CsievJjbgYtZNiS+t1SPOW1GbD3/ynJckihYfJDXJ4 None 2016-06-17 02:20:29 None None None https://twitter.com/122542309/statuses/7436293... [BetAwards, IChooseBreezy, cb] [] None June 7: Chris Brown at Hamburger Barclaycard A... latlng None text None None
9 3 0OjjGVma2W6EnjVtAjsUnAd2tfv0dlq7kGgpZrwUgNs 0101000020E610000069006F81042D24400C8FFD2C96C6... Dr5jwbDjoxk1JaDChxL7OlS7bP2d9+O3YMmMNOIGn5E None 2016-06-17 02:20:56 None None None https://twitter.com/743748012/statuses/7436294... [] [😂, 😭] None People Really Retarded And Google Is Really Fr... latlng None text None None

If you see a permission denied for materialized view ..., grant the rights first:

GRANT SELECT ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;

Commit changes to hlldb (deactivated, enable manually after check):

db_connection_hll.commit()

Preview samples

  • 1 Instagram
  • 2 Flickr
  • 3 Twitter
  • 4 Facebook
  • 5 Foursquare
  • 8 Reddit
  • 23 iNaturalist"
In [26]:
cols = "origin_id, post_body, hashtags, emoji"
samples = db_conn.query(
    f"""
    (SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 1 LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 2 LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 3 LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 23 LIMIT 10);
    """)
In [27]:
from pandas import option_context

with option_context('display.max_colwidth', 400):
    display(samples.head(40))
origin_id post_body hashtags emoji
0 1 So unglaublich schön, wenn man den Strand für sich alleine hat. Der frühe Vogel und so...#meer #strand #morgensamstrand #düne #aufmdarß #kaltaberschön #natur #amliebstenbarfuß #meeresluft #brise #küstenliebe #erholungpur #gleicherstmalkaffee [meer, strand, morgensamstrand, düne, aufmdarß, kaltaberschön, natur, amliebstenbarfuß, meeresluft, brise, küstenliebe, erholungpur, gleicherstmalkaffee] []
1 3 I'm at Steintor in Bremen https://t.co/q1yCZXWamH [] []
2 23 None [] [🐦]
3 3 I'm at Alt-Berliner Biersalon in Berlin https://t.co/RtwZ5kzaB1 [] []
4 2 None [balticsea, bodden, canoneos6d, dars, darss, de, deu, deutschland, ef24105mmf4lisusm, europa, europe, fischlanddarszingst, flipri, fok, germany, img9852, mecklenburgvorpommern, meer, mv, nationalparkvorpommerscheboddenlandschaft, ocean, ostsee, ozean, prerow, sea, strand, zingst] []
5 23 None [] [🕷]
6 3 I'm at El Pikosito in Hamburg https://t.co/W1tsxWOtUN https://t.co/VvC5B0JX03 [] []
7 1 I’ll see you soon again my loved #balticsea 😊 [balticsea] [😊]
8 1 Daily gear - Red Wing Irish Setter model 9874 Moc Toes in wonderful Black Klondike leather, taking a beating the sand.The boots are topped by Indigofera Jeans model Clint STPF in unSanforized Fabric Nr.2, also gathering sand in the cuffs, as a little memory of German beaches. ❤❤,#redwing #redwingmoctoe #redwing9874 #moctoes #brund #indigoferajeans #shrinktofit #hepcatstore #camouflage #german... [redwing, redwingmoctoe, redwing, moctoes, brund, indigoferajeans, shrinktofit, hepcatstore, camouflage, germany, winterbeach] [❤, ❤, 🇩, 🇪]
9 23 None [] [🐛]
10 2 Abenddmmerung am Darer Bodden bei Fuhlendorf (Mecklenburg).~Dawning at Darer Bodden near Fuhlendorf (Mecklenburg). [landschaft, boot, abenddämmerung, fuhlendorf, mecklenburg, deutschland, schilf, see, bodden, sonnenuntergang] []
11 2 None [bank, fuhlendorf, mecklenburgvorpommern, ostsee, sonnenuntergang, urlaub2007] []
12 23 None [] [🌱]
13 1 Am Nordstrand,.,📷 = E-M1+7-14 f2.8,.,Wünsche euch allen einen schönen Abend 🙋,Wish you all a great evening 🙋,.,#blacknwhite_perfection #pocket_bnw #blackandwhite #blackandwhiteonly #only_blackandwhite #blackandwhite_photographers #bnw_fanatics #bnwsouls #bnw_greatshots #bnw_rose #bnw_planet #bnw_international #noir_shots #show_us_bw #amateurs_bnw #getolympus #olympuskameras #olympuscamera #sha... [blacknwhite, pocket, blackandwhite, blackandwhiteonly, only, blackandwhite, bnw, bnwsouls, bnw, bnw, bnw, bnw, noir, show, amateurs, getolympus, olympuskameras, olympuscamera, sharegermany, kodakmomentmentsde, meckpomm, mecklenburgvorpommern, pocket, microfournerds, mft, mftphotos, batpixs, bns, batpixs, germanvision] [📷, 🙋, 🙋]
14 1 Prerow [] []
15 3 Da isser. Der Silvestermann https://t.co/WnsnXlJ7z3 [] []
16 1 Urlauberschiff Neptun, das schwimmende Ferienhaus für bis zu 5 Personen im.Ostseebad Prerow [] []
17 3 I'm at Espresso Perfetto Caffè Bar in Düsseldorf, Nordrhein-Westfalen https://t.co/XSP1XrKziP https://t.co/0eb8c5KjRs [] []
18 2 None [wieckamdars] []
19 23 None [] [🦟]
20 3 THREAD: Ich liebe Waschmittelwerbung. In kaum einem anderen Gattungsmarketing wird das fast immer gleiche, so eindrucksvoll als das ewig neue verkauft. Zum Beispiel Persil. Persil war in der Werbung immer, „das beste Persil, dass es je gab.“ [] []
21 1 ...wenn sich der Tag dem Ende neigt. ,#meer #strand #abendrot #ostsee #küste #meeresluft #auszeit #einfachschön #daslebengeniessen [meer, strand, abendrot, ostsee, küste, meeresluft, auszeit, einfachschön, daslebengeniessen] []
22 1 Der Tatzenabdruck eines #ostseebärs. Nehm ich zumindest so an, vielleicht war es aber auch nur ein Hund 🤔🐶🐻🌊🌞 [ostseebärs] [🐶, 🐻, 🌊, 🌞]
23 23 None [] [🌳]
24 2 am Strand bei Prerow [2470mm, balticsea, beach, buhne, clouds, d750, dars, darss, deutschland, farben, germany, goldenestunde, himmel, meer, mohle, morgen, nikon, nordufer, ostsee, prerow, sky, sonne, sonnenaufgang, sunrise, tamron, teiltonung, wellen, wellenbrecher, wolken] []
25 2 None [strand, dars, meer, wasser, a, leuchtturm, wieck] []
26 1 Beachselfie. #tbt #throwbackthursday #handinhand #coupleshot #couplegoals #meandmyman #beachselfie #sundowner #sandindenschuhen #aufnachmv @aufnachmv #fischlanddarsszingst @fischlanddarsszingst #urlaubindeutschland #urlaubingermany #pocket_germany @germanytourism #prerowstrand #prerow #perspective #ig_color #tv_pointofview #tv_perspective #ig_perspective #canoneos70d #imkesfotoperlen [tbt, throwbackthursday, handinhand, coupleshot, couplegoals, meandmyman, beachselfie, sundowner, sandindenschuhen, aufnachmv, fischlanddarsszingst, urlaubindeutschland, urlaubingermany, pocket, prerowstrand, prerow, perspective, ig, tv, tv, ig, canoneos, imkesfotoperlen] []
27 3 Ich komme wieder! #zut #ultratrailrun #grainau #basetrail… https://t.co/8PKwmbkCj2 [basetrail, grainau, ultratrailrun, zut] []
28 3 Ein Meer, eine Frisur. https://t.co/qJbYXx1yKt [] []
29 2 None [strand, sandstrand, sea, küste, sun, weather, sand, partlycloudy, natur, rostock, 2016, meer, germany, deutschland, ufer, outdoor, cloudy, landschaft, mecklenburg, mecklenburgvorpommern, wetter, landscape, cyclon, clouds, prerow, seebrücke, nature, beach, ostsee, meckelnburg] []
30 2 None [fuhlendorf, mecklenburgvorpommern, ostsee, sonnenuntergang, urlaub2007] []
31 1 Füsse im Sand ,#urlaub #ferien #strand #sand #ostsee #dünen #ferienhaus #hausboot #urlauberschiff #störtebeker #fischlanddarsszingst #prerow #reisen #travelling #holiday [urlaub, ferien, strand, sand, ostsee, dünen, ferienhaus, hausboot, urlauberschiff, störtebeker, fischlanddarsszingst, prerow, reisen, travelling, holiday] []
32 3 Eis essen, Wetter ist wieder schön (@ Müritzterasse Hotel & Restaurant) https://t.co/qFfqcgWDeN https://t.co/YzRG7ItCdZ [] []
33 23 None [] [🦋]
34 3 Just posted a photo @ Wickede (Dortmund) https://t.co/BkFZoSXF7S [] []
35 2 None [] []
36 23 None [] [🦎]
37 2 Got loads of pictures to upload and there is still one week of holiday left. It's just a great place for taking pictures! [canon, detail, eos, sun, pov, bw, cold, baltic, color, beauty, nd, coldtemperature, blue, filter, harbour, mood, colour, shadow, photographer, wood, angle, details, long, germany, wide, pier, longexposure, evening, purple, lighting, zingst, sky, harbor, colors, shadows, light, photography, composition, exposure, focus, atmosphere, 1000d, water, sundown, beautiful, balticsea] []

Extract all user posts globally for those users who have been active in Germany

Working with user data to extract home locations is challenging, due to privacy conflicts. What we want is an optimal tradeoff between acceptable granularity of results (estimated home location) and largest possible granularity of location information, to prevent compromising user privacy. The output only needs to be a three-fold classifier for each user who has been active in Germany:

  • 1 - tourist (home location outside of Germany)
  • 2 - local (home location inside Germany)
  • 3 - both (either not enough data available or ambiguous results)

We can store this output in a HLL shard or a raw list of user IDs and classification. To reach this output, follow this list of steps:

  1. Extract raw list of all unique user ids active in Germany
  2. Extract raw list of all post or userday-locations per user (use Geohash, of e.g. 6, to reduce spatial granularity) 2a. Exclude all users with less than < 50 posts (e.g.), classify these IDs as "both"
  3. For all other users, identify most probably home location based on HDBSCAN Clustering; store this as lat/lng coordinate
  4. Classify in either local, or tourist, based on lat/lng intersection with Germany

We can make use of HyperLogLog, to a small degree. For instance, we could use HLL to aggregate userdays per GeoHash for each users, so that we have an immediate aggregation step that reduces the bulk of post data a bit.

HyperLogLog parameters

The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.

From a privacy perspective, for example, it is recommended to disable explicit mode.

Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.

To disable explicit mode:

In [28]:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Out[28]:
hll_set_defaults
0 (11,5,-1,1)

Data Pseudonymization and Generalization

Before processing, we establish two crucial functions to protect user privacy, in line with ethical research standards.

1. Cryptographic Hashing for Pseudonymization The HyperLogLog algorithm uses MurMurHash, a non-cryptographic hashing function. While fast, it can be vulnerable to certain attacks (see Desfontaines et al. 2018). To mitigate this and prevent the re-identification of users, we add a robust layer of security. We use the pgcrypto extension to apply a seeded cryptographic hash (SHA-256 with HMAC) to all user identifiers.

This process converts a user's original ID into a secure, non-reversible pseudonym. The use of a secret key (or "salt)") ensures that the same user ID will always produce the same pseudonym within our dataset, but this pseudonym cannot be reverse-engineered or matched against other datasets without the key.

2. Spatial Generalization with Geohash To reduce the spatial precision of location data, we snap GPS coordinates to a coarser grid using the Geohash algorithm. This function rounds coordinates to a specified precision, effectively aggregating points within a defined area. This protects privacy by obfuscating exact locations while retaining sufficient detail for neighborhood-level or regional analysis.

References:

Desfontaines, D., Lochbihler, A., & Basin, D. (2018). Cardinality Estimators do not Preserve Privacy. 1–21.

Create the pgcrypto extension:

In [28]:
sql_query = "CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA extensions;"
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE EXTENSION

Prepare cryptographic hash function. The following function will take an id and a key (the seed value) to produce a new, unique hash that is returned in hex encoding.

In [29]:
sql_query = """
/* Produce pseudonymized hash of input id with skey
 * - using skey as seed value
 * - sha256 cryptographic hash function
 * - encode in base64 to reduce length of hash
 * - remove trailing '=' from base64 string
 * - return as text
 */
CREATE OR REPLACE FUNCTION 
extensions.crypt_hash (id text, skey text)
RETURNS text
AS $$
    SELECT 
        RTRIM(
            ENCODE(
                HMAC(
                    id::bytea,
                    skey::bytea,
                    'sha256'), 
                'base64'),
            '=')
$$
LANGUAGE SQL
STRICT;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE FUNCTION

Note: Cryptographic hashing alone will only produce pseudonymized data, since any id still relates to a single user (or post, etc.). Therefore, pseudonymization is considered a weak measure, which can be easily reversed, e.g. through [rainbow tables] or context lookup.

It is used here as an additional means to protect HLL sets from intersection attacks, e.g. as is discussed by Desfontaines et al. (2018).

What is a seed value? The seed value (skey) is a secret that is used, together with the encryption function (e.g. sha256) to produce a unique, collision-free output value (the hashed id). The same ID will be converted to a different hash if the seed value is changed. Therefore, in our case, the seed value must remain the same during the entire processing of data. In this case, the seed is called a key. This key can be destroyed afterwards, if no subsequent updates are necessary.

In [30]:
sql_query = """
/* Reduce spatial granularity of coordinates by GeoHash
   - will keep Null Island unmodified
 */
CREATE OR REPLACE FUNCTION 
extensions.geohash_reduce (IN coord geometry, geohash integer DEFAULT 5)
RETURNS geometry
AS $$
    SELECT
        CASE WHEN ST_Y(coord) = 0 AND ST_X(coord) = 0
        THEN
            coord
        ELSE
            ST_PointFromGeoHash(ST_GeoHash(coord, geohash), geohash)
        END as "coord"    
$$
LANGUAGE SQL
STRICT;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE FUNCTION

Step 1: Extract Unique Users Active in Germany

The first step in our analysis is to create a definitive list of all users who have posted at least one time from within Germany. We create a materialized view for this list, mviews.unique_user_de, which improves performance in subsequent queries.

Optional cleanup step:

sql_query = f""" DROP MATERIALIZED VIEW IF EXISTS mviews.unique_user_de; """ cur_hll.execute(sql_query) print(cur_hll.statusmessage)
In [53]:
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.unique_user_de AS
SELECT DISTINCT t1.origin_id, t1.user_guid
FROM   mviews.all_posts_de t1;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
SELECT 6233753
CPU times: user 8.87 ms, sys: 0 ns, total: 8.87 ms
Wall time: 4min 5s

Consistency check and validation: The above 6,233,753 exact user counts confirm our earlier abstracted HLL cardinality estimation. We can also query the count, to check again:

In [54]:
db_conn_hll.query(f"SELECT count(user_guid) FROM mviews.unique_user_de;")
Out[54]:
count
0 6233753
In [55]:
db_connection_hll.commit()

Step 2: Fetch Global Post History for German Users

With our list of Germany-active users, we now need to retrieve their entire global posting history. This is the most computationally intensive step, as it involves joining our list of ~6 million users against the main posts table containing hundreds of millions of entries.

The following query is executed directly on the lbsn-rawdb server for performance reasons. It performs several actions simultaneously:

  • It joins the unique_user_de list with the main topical.post table.
  • It applies the cryptographic hash function (extensions.crypt_hash) to the user_guid to pseudonymize the data at the source.
  • It reduces the precision of the coordinates using our geohash_reduce function.
  • It creates a userday identifier by concatenating the hashed user ID with the date. This unique marker is central to our home location inference, as it represents a day of activity for a specific user.

First, to speed up the calculation, we will add an index to the materialized view, since indexes are not inherited.

%%time # optional cleanup sql_query = f""" DROP INDEX IF EXISTS idx_uniqueuser_userguid; """ cur_hll.execute(sql_query) print(cur_hll.statusmessage)
In [56]:
%%time
sql_query = f"""
CREATE INDEX idx_uniqueuser_userguid 
ON mviews.unique_user_de (user_guid);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE INDEX
CPU times: user 1.05 ms, sys: 194 μs, total: 1.24 ms
Wall time: 3.97 s

Ok, the above query cannot be compared because topical.post is on a foreign table, and comparing 800M rows through the fdw wrapper is impractical.

byobu
docker exec -it lbsn-rawdb /bin/bash
psql -h localhost -p 5432 -U postgres rawdb
-- Enable timing:
\timing
-- test
select * from mviews.all_posts_de limit 1;

Therefore, execute both these queries below on the foreign database directly, connected to the postgres backend, as it may take a while to compute.

CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.unique_user_de AS
SELECT DISTINCT t1.origin_id, t1.user_guid
FROM   mviews.all_posts_de t1;

CREATE INDEX idx_uniqueuser_userguid 
ON mviews.unique_user_de (user_guid);

Below is the "big" query:

CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.posts_from_de_users AS
SELECT p.origin_id,
       extensions.crypt_hash(p.user_guid, '[redacted]') AS user_guid,
       ST_Y(extensions.geohash_reduce(p.post_latlng, {GEOHASH_PRECISION})) AS latitude,
       ST_X(extensions.geohash_reduce(p.post_latlng, {GEOHASH_PRECISION})) AS longitude,
       (extensions.crypt_hash(p.user_guid, '[redacted]') || to_char(COALESCE(NULLIF(p.post_create_date, ''), p.post_publish_date), 'yyyy-MM-dd')) AS userday
FROM   topical.post p
JOIN   mviews.unique_user_de u
       ON p.user_guid = u.user_guid;
> SELECT 291557025
> Time: 7977669.774 ms (02:12:57.670)

The query took about 2 hours and selected 291,557,025 Million posts from German tourists or locals worldwide.

Step 3: Filter Data for Reliable Home Location Inference

Before we can infer home locations, we must filter the 292 million collected posts to improve the quality and reliability of our input data. There are two main filtering criteria:

  1. User Activity: Reliably inferring a home location requires a sufficient amount of data per user. Following best-practice parameter selection from previous study results (e.g., Huang et al., 2014), we set a minimum threshold of user activity. In this case, we exclude users with fewer than 30 unique userdays (days on which they posted). This removes users whose data is too sparse to yield a meaningful result.
  2. Invalid Coordinates: The dataset may contain erroneous geotags, with a common error being posts located at (0.0, 0.0), often referred to as "Null Island". Since these points provide no valid geographical information, they are excluded from the analysis.

These filtering steps are performed on the lbsn-rawdb server to create a new materialized view, posts_from_de_users_filtered. This view contains the cleaned and pseudonymized data that will serve as the basis for our home location analysis.

Import worldwide users posts via FDW

Check if already imported:

In [41]:
MVIEWS_REF_USER = 'posts_from_de_users'

sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'mviews'
   AND    table_name   = '{MVIEWS_REF_USER}'
   );
"""
result = db_conn_hll.query(sql_query)
In [42]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {MVIEWS_REF_USER})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
IMPORT FOREIGN SCHEMA

If you see a permission denied for materialized view ..., grant the rights first:

GRANT SELECT ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;

Make sample selectin (similar to the one we used in the beginning of the notebook).

In [45]:
cols = "origin_id, user_guid, latitude, longitude, userday"
samples = db_conn_hll.query(
    f"""
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 1 LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 2 LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 3 LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 23 LIMIT 10);
    """)
In [46]:
with option_context('display.max_colwidth', 400):
    display(samples.head(40))
origin_id user_guid latitude longitude userday
0 1 kMDfOJvENI1Ng2eoodUiXzpcFpIBg+9oggGHzQt2vgg 0.000000 0.000000 kMDfOJvENI1Ng2eoodUiXzpcFpIBg+9oggGHzQt2vgg2018-03-30
1 3 +4bkudodB5teCvCf/Utbv8prpoBGGn+3g/ag/m5avds 41.022949 28.674316 +4bkudodB5teCvCf/Utbv8prpoBGGn+3g/ag/m5avds2016-06-10
2 23 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI 29.992676 -99.382324 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI2020-05-10
3 1 g+rlgiPaziS2VCzxYgjWummcJOw25Jo5avqBrJqGtB8 0.000000 0.000000 g+rlgiPaziS2VCzxYgjWummcJOw25Jo5avqBrJqGtB82017-10-15
4 1 nrgLCixMp7CaSXVI7wTrAClQkPFc3nkeM00Bdu23wC8 0.000000 0.000000 nrgLCixMp7CaSXVI7wTrAClQkPFc3nkeM00Bdu23wC82017-04-06
5 2 wKm5WC6nFSu9D9kxxxyvRp2rmQLmRtKKtt9IbrkzrOo 0.000000 0.000000 wKm5WC6nFSu9D9kxxxyvRp2rmQLmRtKKtt9IbrkzrOo2013-05-14
6 3 a/5nS9vxACtbibGP+dr844V0XCvgjfaj5OFJSI7GUB0 48.273926 16.413574 a/5nS9vxACtbibGP+dr844V0XCvgjfaj5OFJSI7GUB02016-06-10
7 23 SwmG8jB1LIz9XJ9ph2jnqABjQLZLARwUcY2xmXD1ju0 63.874512 -149.040527 SwmG8jB1LIz9XJ9ph2jnqABjQLZLARwUcY2xmXD1ju02022-07-29
8 23 FGdFg9Y3VmHKRh0AaYDIV54Fud3bs6g2njF/TPZIBq0 53.635254 -2.922363 FGdFg9Y3VmHKRh0AaYDIV54Fud3bs6g2njF/TPZIBq02020-06-20
9 1 nrgLCixMp7CaSXVI7wTrAClQkPFc3nkeM00Bdu23wC8 0.000000 0.000000 nrgLCixMp7CaSXVI7wTrAClQkPFc3nkeM00Bdu23wC82018-01-04
10 23 yuZnGBpOiOcAHz4MY+hWnMtUcj7ytv1u0Uco6nbdXyM 50.954590 4.196777 yuZnGBpOiOcAHz4MY+hWnMtUcj7ytv1u0Uco6nbdXyM2022-07-17
11 23 GuanIHrPSpXXHbxWvKS5CAoDpFY8Gx0cHKHFmsB5yaI 41.857910 -73.498535 GuanIHrPSpXXHbxWvKS5CAoDpFY8Gx0cHKHFmsB5yaI2021-05-21
12 23 vqRZEVNQBizXpuuRTT1A6LJiv2TMZ1DZGQ33qJTByVU -33.420410 -70.598145 vqRZEVNQBizXpuuRTT1A6LJiv2TMZ1DZGQ33qJTByVU2020-10-16
13 3 eGqTi0qpyJ4Kd03bt+IcsZYWTaUp3V6JP2AHlDDuFDk 52.404785 12.546387 eGqTi0qpyJ4Kd03bt+IcsZYWTaUp3V6JP2AHlDDuFDk2016-09-06
14 3 Q8O36wmadMt5zhegNh7bIQEKH/Yaxrc5BgeQevlXm4A 40.056152 29.509277 Q8O36wmadMt5zhegNh7bIQEKH/Yaxrc5BgeQevlXm4A2016-06-10
15 2 WExKGKIuJDJUvdO4cAZ4WxY63VAheBcZHKINh+AG7YQ 0.000000 0.000000 WExKGKIuJDJUvdO4cAZ4WxY63VAheBcZHKINh+AG7YQ2014-03-07
16 1 hRa2CLizJFv4H8z3/T/h+L24rVMV5XnbqGux3b5TrbU 0.000000 0.000000 hRa2CLizJFv4H8z3/T/h+L24rVMV5XnbqGux3b5TrbU2018-03-08
17 2 wKm5WC6nFSu9D9kxxxyvRp2rmQLmRtKKtt9IbrkzrOo 0.000000 0.000000 wKm5WC6nFSu9D9kxxxyvRp2rmQLmRtKKtt9IbrkzrOo2011-09-30
18 2 2RatSf70eBDaz3NrjqqYNMbSFtKqJ83rj60Ym6TavpI 0.000000 0.000000 2RatSf70eBDaz3NrjqqYNMbSFtKqJ83rj60Ym6TavpI2011-04-17
19 2 WiT0e/H/v9ym8qJRdhfJ+03SmHfRwtQrzaef2OcG9Pc 36.716309 -4.416504 WiT0e/H/v9ym8qJRdhfJ+03SmHfRwtQrzaef2OcG9Pc2011-12-21
20 3 v9kldUBUa9e65APzWWwN9kW4SleOTzUA9FOKVaNdzLI 46.472168 30.739746 v9kldUBUa9e65APzWWwN9kW4SleOTzUA9FOKVaNdzLI2016-06-10
21 1 8jH8kpE8mohkgH1tLM6gHfvwsi8neKjK7Our6/YUB40 0.000000 0.000000 8jH8kpE8mohkgH1tLM6gHfvwsi8neKjK7Our6/YUB402018-03-05
22 1 9KuMJzUymuJP4v6yIQ3cWW5zYwo5/wSTtarrgOqFvQ8 0.000000 0.000000 9KuMJzUymuJP4v6yIQ3cWW5zYwo5/wSTtarrgOqFvQ82018-02-21
23 2 WExKGKIuJDJUvdO4cAZ4WxY63VAheBcZHKINh+AG7YQ 0.000000 0.000000 WExKGKIuJDJUvdO4cAZ4WxY63VAheBcZHKINh+AG7YQ2014-03-06
24 23 s4E5fdRmS12LoYFgclueJM56PUguVqg8by8cl6SCiM4 54.733887 158.576660 s4E5fdRmS12LoYFgclueJM56PUguVqg8by8cl6SCiM42011-04-10
25 2 wKm5WC6nFSu9D9kxxxyvRp2rmQLmRtKKtt9IbrkzrOo 0.000000 0.000000 wKm5WC6nFSu9D9kxxxyvRp2rmQLmRtKKtt9IbrkzrOo2011-10-01
26 23 s4E5fdRmS12LoYFgclueJM56PUguVqg8by8cl6SCiM4 54.733887 158.620605 s4E5fdRmS12LoYFgclueJM56PUguVqg8by8cl6SCiM42022-06-25
27 23 rlMKzQ9G9xbhEtO2rG7wXHkOp+Ym9JxDdKzTpqtdQ7c 39.177246 -76.838379 rlMKzQ9G9xbhEtO2rG7wXHkOp+Ym9JxDdKzTpqtdQ7c2022-07-30
28 1 XHnmV1NFudCFq2o8FvGf8AH1IeTC0AmkGgKIfa7eaMg 0.000000 0.000000 XHnmV1NFudCFq2o8FvGf8AH1IeTC0AmkGgKIfa7eaMg2017-09-29
29 2 gC9ugbSqRWM/0jfyxPO/V2niP78vmztSohtjricIk30 0.000000 0.000000 gC9ugbSqRWM/0jfyxPO/V2niP78vmztSohtjricIk302017-08-25
30 1 hRa2CLizJFv4H8z3/T/h+L24rVMV5XnbqGux3b5TrbU 0.000000 0.000000 hRa2CLizJFv4H8z3/T/h+L24rVMV5XnbqGux3b5TrbU2018-03-06
31 1 Gh1781CIEz+wAV9073ckp56VBOVwVbbjOB52is4L3uo 0.000000 0.000000 Gh1781CIEz+wAV9073ckp56VBOVwVbbjOB52is4L3uo2017-08-28
32 3 +4bkudodB5teCvCf/Utbv8prpoBGGn+3g/ag/m5avds 41.022949 28.981934 +4bkudodB5teCvCf/Utbv8prpoBGGn+3g/ag/m5avds2016-06-10
33 23 GuanIHrPSpXXHbxWvKS5CAoDpFY8Gx0cHKHFmsB5yaI 56.315918 -120.959473 GuanIHrPSpXXHbxWvKS5CAoDpFY8Gx0cHKHFmsB5yaI2020-08-08
34 3 jTgw7uZ/+q7lhxMxYvjOgkJchbC4oAis7DLAKIIblo8 39.484863 -8.195801 jTgw7uZ/+q7lhxMxYvjOgkJchbC4oAis7DLAKIIblo82016-06-10

If the sample was extracted without errors, commit changes to the local working DB:

In [47]:
db_connection_hll.commit()

Filter data

Before classification, let's filter the data for the categories of both or ambiguous. In line with the general parameter recommendations set out in the paper by Huang et al. (2014, 'From Where Do Tweets Originate?'), we can set a lower limit of 50 posts per user below which it is no longer possible to unambiguously identify the user's home location. Rather than using post count, we use userdays as a global measure of the total number of days on which the user has posted at least once. We may reduce the number of userdays to 30, to account for the slighly increased aggregation granularity.

We can also see what we can additionally filter in the query output above. The Null Island is prominently present in the first few elements selected from the database. These are erroneous posts with incorrect geotags. As they don't provide reliable geo-information for inferring home location, we can exclude them too.

Let's first test to exclude the Null Island

In [49]:
cols = "origin_id, user_guid, latitude, longitude, userday"
samples = db_conn_hll.query(
    f"""
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 1 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 2 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 3 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 23 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10);
    """)
In [51]:
with option_context('display.max_colwidth', 400):
    display(samples.head(20))
origin_id user_guid latitude longitude userday
0 23 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI 54.470215 37.727051 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI2020-07-24
1 3 AQ4IzvgwM/0bxTZwcIMCfbRVlZMMnqxLEZ+nLfYXpTY 42.429199 11.140137 AQ4IzvgwM/0bxTZwcIMCfbRVlZMMnqxLEZ+nLfYXpTY2017-12-31
2 3 vT2uHYkzfeG0AYyDXDZFydwcu8/C3uEj5TvTuUDmP/c 39.836426 28.146973 vT2uHYkzfeG0AYyDXDZFydwcu8/C3uEj5TvTuUDmP/c2016-06-10
3 23 GuanIHrPSpXXHbxWvKS5CAoDpFY8Gx0cHKHFmsB5yaI 42.604980 -80.441895 GuanIHrPSpXXHbxWvKS5CAoDpFY8Gx0cHKHFmsB5yaI2017-09-19
4 3 EN1L3zUyVWSF8c1ylS85hs8Qafd6S3hm09XkxltgQN8 43.791504 11.271973 EN1L3zUyVWSF8c1ylS85hs8Qafd6S3hm09XkxltgQN82016-06-10
5 2 xHIYRwFpei5Xy4Y3twbzazIP3HuV37nqidyZZVIXEgU 42.780762 9.470215 xHIYRwFpei5Xy4Y3twbzazIP3HuV37nqidyZZVIXEgU2014-04-27
6 23 UoeC6L/PHeNPlBGKSiLGg2xZyaPTnGIMqw49iY2N+bM 42.077637 -71.125488 UoeC6L/PHeNPlBGKSiLGg2xZyaPTnGIMqw49iY2N+bM2016-04-02
7 2 KBqrqtiu48jX+qm3nCDTbtxw0Nr3qSRr8VyGBdUTqhI 42.429199 12.150879 KBqrqtiu48jX+qm3nCDTbtxw0Nr3qSRr8VyGBdUTqhI2017-04-14
8 3 RZFlO2YBRyr1+XiofHlBBmlDCq1soe/w+DkZ2B2QG6I 41.418457 2.131348 RZFlO2YBRyr1+XiofHlBBmlDCq1soe/w+DkZ2B2QG6I2017-09-30
9 2 dE4nsP0isPLsNlHvbGsKuDj5ObDwhgJcrVdzIUOWKwY 42.780762 9.470215 dE4nsP0isPLsNlHvbGsKuDj5ObDwhgJcrVdzIUOWKwY2013-09-29
10 2 2hX85AQbTuqme9Hg/ZgvHWfHA3mOP/hMKrB+MsgXfvc 42.429199 12.106934 2hX85AQbTuqme9Hg/ZgvHWfHA3mOP/hMKrB+MsgXfvc2013-07-20
11 1 LQLwusE/3u9nzB7xLShbNSt7wnho7qCe3gnavoJkhvg 42.429199 12.150879 LQLwusE/3u9nzB7xLShbNSt7wnho7qCe3gnavoJkhvg2018-03-18
12 23 HLgGwqXpnegmQ11O0Y+WyNxosxA2tPUx6qLpbmvpO3E 33.728027 -111.862793 HLgGwqXpnegmQ11O0Y+WyNxosxA2tPUx6qLpbmvpO3E2023-04-03
13 2 5pgPpRlCRLzu8BEVyd+XYFu004NzEt/I6qxxgW4VgKI 42.604980 13.864746 5pgPpRlCRLzu8BEVyd+XYFu004NzEt/I6qxxgW4VgKI2007-05-01
14 2 WaJzlT61ZRiSgYDacXp7iDbghCJhG3g1kMlOO6vzRWs 42.736816 13.425293 WaJzlT61ZRiSgYDacXp7iDbghCJhG3g1kMlOO6vzRWs2012-12-22
15 3 0pRQ/d20IdqNFLmvoQlF1ubgdZViY0zI/AvwXtc/SrU 41.374512 2.175293 0pRQ/d20IdqNFLmvoQlF1ubgdZViY0zI/AvwXtc/SrU2016-06-10
16 23 SwmG8jB1LIz9XJ9ph2jnqABjQLZLARwUcY2xmXD1ju0 37.551270 128.430176 SwmG8jB1LIz9XJ9ph2jnqABjQLZLARwUcY2xmXD1ju02020-08-16
17 3 r90or3zDNrQ8EsHDT14WDRbIpV5BPaLsGqlUYYNuNTY 53.547363 9.997559 r90or3zDNrQ8EsHDT14WDRbIpV5BPaLsGqlUYYNuNTY2016-06-10
18 1 wIcgBc5grJxwvrxnIQzfj/lWRe83CusGYcivhbEvB8k 42.429199 12.150879 wIcgBc5grJxwvrxnIQzfj/lWRe83CusGYcivhbEvB8k2017-05-07
19 3 rJgzOhWgw5ebg2cARau33jlnKIw1/Fs5m+DqVhEu+No 51.437988 7.009277 rJgzOhWgw5ebg2cARau33jlnKIw1/Fs5m+DqVhEu+No2016-06-10

This looks good.

We are now only missing the filter step to exclude all users below 30 userdays total. We will do this together with the Null Island exclusion. And we will do it again on our foreign raw server, to not send 292 Million elements through the Foreign Data Wrapper.

First, create indexes on the original materialized view, so the query speed is increased:

-- For user-level lookups
CREATE INDEX idx_posts_from_de_users_userguid
ON mviews.posts_from_de_users (user_guid);

-- For userday-based queries
CREATE INDEX idx_posts_from_de_users_userday
ON mviews.posts_from_de_users (userday);

-- Optional for spatial queries
CREATE INDEX idx_posts_from_de_users_lat_lon
ON mviews.posts_from_de_users (latitude, longitude);

Then, create a filtered mview containing only the valid post data required to infer user home locations. We split this into two distinct queries, to speed up compute and reduce complexity of the sql query.

First, group by distinct userdays and select only userday in the output selection. We can later extract the hashed user_guid substring from the result.

CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.distinct_user_days_de_users AS
SELECT userday
FROM   mviews.posts_from_de_users
GROUP  BY userday;
> SELECT 106540196
> Time: 96678.938 ms (01:36.679)

This step reduced the dataset from ~292 million original posts to ~106 million distinct userdays, a ~70% reduction in size.

How many users will be excluded, if we filter by > 30 userdays?

WITH excluded_users AS (
    SELECT left(userday, length(userday) - 10) AS user_guid
    FROM mviews.distinct_user_days_de_users
    GROUP BY left(userday, length(userday) - 10)
    HAVING COUNT(*) <= 30
)
SELECT COUNT(*) AS number_of_excluded_users
FROM excluded_users;
>  number_of_excluded_users
> --------------------------
>                   5646069
> (1 row)
> 
> Time: 122854.215 ms (02:02.854)
In [3]:
5646069 / 6233753 * 100
Out[3]:
90.57254915297413

This will exclude about 90% of users in our dataset. These users do not have enough user activity to derive the user homne location. These numbers tell a very important story about the nature of social media data (see the long-tail or power-law distribution).

Lastly, run the query below to create a new mview by selecting active users with > 30 distinct userdays and by filtering the Null Island.

CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.posts_from_de_users_filtered AS
WITH active_users AS (
    SELECT left(userday, length(userday) - 10) AS user_guid
    FROM   mviews.distinct_user_days_de_users
    GROUP  BY left(userday, length(userday) - 10)
    HAVING COUNT(*) > 30
)
SELECT p.*
FROM   mviews.posts_from_de_users p
JOIN   active_users a
       ON p.user_guid = a.user_guid
WHERE  NOT (p.latitude = 0.0 AND p.longitude = 0.0);
SELECT 248294113
Time: 704626.323 ms (11:44.626)

This step filtered the dataset to ~248 million valid posts (excluding Null Island and low-activity users), down from the original ~292 million. Despite the large scale, the query completed in under 12 minutes.

Import to our local worker database.

In [66]:
MVIEWS_REF_USER_FILTERED = 'posts_from_de_users_filtered'

sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'mviews'
   AND    table_name   = '{MVIEWS_REF_USER_FILTERED}'
   );
"""
result = db_conn_hll.query(sql_query)
In [67]:
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
    LIMIT TO (
        {MVIEWS_REF_USER_FILTERED})
    FROM SERVER lbsnraw 
    INTO mviews;
"""
# only import table 
# if it hasn't been imported already
if not result["exists"][0]:
    cur_hll.execute(sql_query)
    print(cur_hll.statusmessage)
IMPORT FOREIGN SCHEMA

If you see a permission denied for materialized view ..., grant the rights first:

GRANT SELECT ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;
In [68]:
cols = "origin_id, user_guid, latitude, longitude, userday"
samples = db_conn_hll.query(
    f"""
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 1 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 2 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 3 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
    UNION
    (SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 23 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10);
    """)
In [69]:
with option_context('display.max_colwidth', 400):
    display(samples.head(40))
origin_id user_guid latitude longitude userday
0 23 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI 54.470215 36.188965 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI2020-06-27
1 1 sgr84Ax2dPOzDsMEnVD5OEVsF5B7nqbzOk81Bxi8/ds 42.341309 -71.037598 sgr84Ax2dPOzDsMEnVD5OEVsF5B7nqbzOk81Bxi8/ds2014-09-13
2 23 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI 56.403809 43.835449 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI2020-09-12
3 23 kw9VFBn4p/OwH4Xl7qJCEkP+jfXckt6cMYE2Ss7HTPk 41.110840 -74.245605 kw9VFBn4p/OwH4Xl7qJCEkP+jfXckt6cMYE2Ss7HTPk2019-10-01
4 3 JxRpnA+K3soeKCprx++GZDGDh2VQCGZXBR+uw/j7JGQ 48.845215 2.351074 JxRpnA+K3soeKCprx++GZDGDh2VQCGZXBR+uw/j7JGQ2016-06-17
5 23 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI 57.722168 59.875488 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI2020-06-19
6 23 kw9VFBn4p/OwH4Xl7qJCEkP+jfXckt6cMYE2Ss7HTPk 40.495605 -73.806152 kw9VFBn4p/OwH4Xl7qJCEkP+jfXckt6cMYE2Ss7HTPk2020-10-08
7 3 KwZVnrdOZGapvUKnK3j92efAFivvzYMXvHGYEJ/X3ag 38.474121 27.224121 KwZVnrdOZGapvUKnK3j92efAFivvzYMXvHGYEJ/X3ag2016-06-17
8 2 Ucptu7IM4imYZWol4hgBvCc0WYN1ZQkn3Uijusx9eqw 42.429199 -71.213379 Ucptu7IM4imYZWol4hgBvCc0WYN1ZQkn3Uijusx9eqw2009-09-13
9 3 3e9FlL+226y+XtNIZeXMAbNKaeZmOGbJfgtbsFTdP70 50.822754 3.493652 3e9FlL+226y+XtNIZeXMAbNKaeZmOGbJfgtbsFTdP702016-06-17
10 3 0a2EKEblW0ABA3MOv5w0YihLBuiOKM6s+zgSf0eUFjc 53.635254 9.997559 0a2EKEblW0ABA3MOv5w0YihLBuiOKM6s+zgSf0eUFjc2016-06-17
11 23 u7g5IxX+uvCaNWF9qNnC0+ndebi1BrNgPlyBckas760 -33.859863 151.237793 u7g5IxX+uvCaNWF9qNnC0+ndebi1BrNgPlyBckas7602021-02-14
12 23 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI 55.832520 48.669434 6su6xTcLVjs8dizW895H6DPXjTNMleiDjjMYTHxOzRI2020-07-01
13 3 4rVXNRfMljXN3fyucJ3jLfYbjusYP3PXKKTi246lXv0 53.811035 -9.514160 4rVXNRfMljXN3fyucJ3jLfYbjusYP3PXKKTi246lXv02016-06-17
14 23 UoeC6L/PHeNPlBGKSiLGg2xZyaPTnGIMqw49iY2N+bM 33.859863 -90.109863 UoeC6L/PHeNPlBGKSiLGg2xZyaPTnGIMqw49iY2N+bM2018-10-11
15 3 1oIlXDHUkr0lJJQdSCRbVjB+VJ+uCMmHkbmuJXHOhtw 52.272949 6.174316 1oIlXDHUkr0lJJQdSCRbVjB+VJ+uCMmHkbmuJXHOhtw2016-06-17
16 23 kw9VFBn4p/OwH4Xl7qJCEkP+jfXckt6cMYE2Ss7HTPk 41.462402 -73.850098 kw9VFBn4p/OwH4Xl7qJCEkP+jfXckt6cMYE2Ss7HTPk2019-09-03
17 3 rPKZFCqfQCDJA0o8bgE5CaJu3Jj85uSgNh8JrxhKqXE 46.647949 14.436035 rPKZFCqfQCDJA0o8bgE5CaJu3Jj85uSgNh8JrxhKqXE2016-06-17
18 1 k0b0w+Tnxhg709TD0TGDeq3u9wKaiRfXgYVV5rpHy/g 42.473145 -71.213379 k0b0w+Tnxhg709TD0TGDeq3u9wKaiRfXgYVV5rpHy/g2018-03-15
19 3 dhLLtHeqS/RxgDrDzO+ad5zm77/A7hY4T3sBypZDLlY 52.448730 4.855957 dhLLtHeqS/RxgDrDzO+ad5zm77/A7hY4T3sBypZDLlY2016-06-17
20 23 wcQDol0ILvFf6UvVnASVn6+eFw9n8hMNpTUJVYMEFa0 33.508301 -82.199707 wcQDol0ILvFf6UvVnASVn6+eFw9n8hMNpTUJVYMEFa02020-06-27
21 3 kWBisviS1tQ1997rn4Q1BuxpbPFzmnaB/zmg+Ywh2Mg 41.022949 29.113770 kWBisviS1tQ1997rn4Q1BuxpbPFzmnaB/zmg+Ywh2Mg2016-06-17
22 3 4rVXNRfMljXN3fyucJ3jLfYbjusYP3PXKKTi246lXv0 54.118652 -9.162598 4rVXNRfMljXN3fyucJ3jLfYbjusYP3PXKKTi246lXv02016-06-17

Step 4: Aggregate Data for Clustering Analysis

This is the final data preparation step before the home location clustering can be performed (in a subsequent process). The goal here is to aggregate the data by user and unique location, counting the number of distinct user-days for each point. This creates a summary of how frequently each user visits specific locations.

To do this efficiently, we insert the filtered data into a new table in our local hlldb, named spatial.latlng_de_hll. The query performs the following aggregation:

  • It groups the data by user_guid and location (latitude, longitude).
  • For each group, it uses hll_add_agg to create a HyperLogLog structure containing all the unique userday hashes. The cardinality of this HLL set gives us a highly accurate and memory-efficient estimate of the number of days a user was active at that specific location.

The resulting table provides a compact summary for each user, listing all the locations they have visited and the number of days they spent at each, ready for a density-based clustering algorithm like HDBSCAN to identify their most probable home location.

In [64]:
sql_query = """
CREATE TABLE IF NOT EXISTS spatial.latlng_de_hll (
    origin_id   int,
    user_guid   text,
    userday_hll hll,
    latlng_geom geometry(Point, 4326)
);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE TABLE

Store changes to DB.

In [70]:
db_connection_hll.commit()

Aggregate HLL by user + lat/lng:

  • connect to the backend of the local hll db
  • execute directly in the pgsql shell
byobu
docker exec -it lbsn-hlldb /bin/bash
psql -h localhost -p 5432 -U postgres hlldb
-- Enable timing:
\timing
-- test
select * from mviews.posts_from_de_users_filtered limit 1;
>  origin_id |                  user_guid                  |    latitude     |    longitude    |                        userday
> -----------+---------------------------------------------+-----------------+-----------------+-------------------------------------------------------
>          1 | DxC+3nuAOXEF5fQN8ngX3AFDLuAF/algMxhp0dpbrU4 | -37.81494140625 | 144.99755859375 | DxC+3nuAOXEF5fQN8ngX3AFDLuAF/algMxhp0dpbrU42017-10-19
> (1 row)

On the raw/remote table: Add indexes

CREATE INDEX idx_posts_grouping
ON posts_from_de_users_filtered (origin_id, latitude, longitude, user_guid);

On the local/hll db, set HLL aggregation parameters (see above):

SELECT hll_set_defaults(11, 5, 0, 1);
  • this disables explicit mode, which is important for privacy preservation

Then run the aggregation query on the local/hlldb.

INSERT INTO spatial.latlng_de_hll(
            origin_id, 
            user_guid,
            userday_hll,
            latlng_geom)
SELECT  origin_id,
        user_guid,
        hll_add_agg(hll_hash_text(userday)) AS userday_hll,
        ST_SetSRID(ST_MakePoint(longitude, latitude), 4326) AS latlng_geom
FROM mviews.posts_from_de_users_filtered
GROUP BY origin_id, latitude, longitude, user_guid;
> INSERT 0 27,409,403
> Time: 1479633.236 ms (24:39.633)

In this final data preparation step, the raw post data is transformed into a compact summary of each user's geographic footprint. The initial filtered table contained approximately 248 million rows, with each row representing a single post from a user. The aggregation query then collapses this data, creating just one row for each unique combination of a user and a specific location (geohash cell). Consequently, if an individual posted hundreds of times from the same location across different days, all those entries are now represented by a single row, stored in the userday_hll column. The detailed temporal information is not lost; rather, the number of distinct days the user was active at that spot is efficiently stored within the userday_hll data structure and can be estimated with a HLL cardinality calculation. This crucial transformation resulted in a significant and expected reduction in dataset size from 248 million rows to roughly 27 million rows, producing a clean, efficient table where each row signifies a unique place a person has visited, perfectly setting up the data for clustering to determine their home location.

Have a look at the resulting data structure.

In [17]:
db_conn_hll.query(f"SELECT * FROM spatial.latlng_de_hll LIMIT 10;")
Out[17]:
origin_id user_guid userday_hll latlng_geom
0 1 DCdJx28GmltEEolm0yQwxvY+Apsey43pLYYKNa6tPxc \x138b40e7e3 0101000020E6100000000000004CDC63C000000000987E...
1 1 HFdpiy6Zwdfo7yf5UwbKu/PGUtq9396y/zVfTly5yEg \x138b401a832d2136e2524563a1a783b741 0101000020E6100000000000004CDC63C000000000987E...
2 1 VncevvIHzngxF/O2wOJLULVqfxqRAcV9rpw6/S2VzxM \x138b40f061 0101000020E6100000000000004CDC63C000000000987E...
3 1 qEVra/yUsCCriW/qawmZ7C7Jn5/QqPuY8kisnMLT0TI \x138b40c603 0101000020E6100000000000004CDC63C000000000987E...
4 1 vjbncjQPC+6jnAJ7mpc01kiVNwP3OkRcADi/LqoEhSA \x138b402ea24d61cae2fb21 0101000020E6100000000000004CDC63C000000000987E...
5 1 N8XE75etPF/LYOYVgqDwjACW/V5sDCV37HnlesH41u8 \x138b4003018a82 0101000020E6100000000000000080963F00000000987E...
6 1 Lb1OdaArMrTjtYvDLOwe9HpacpuDHBxe6DjVWvQWDpE \x138b406142 0101000020E610000000000000545F60C000000000F878...
7 1 ZNwYBRdbVDs5s6JWqXD29e4x12QrQqJe+n3boR4/SWQ \x138b401002 0101000020E610000000000000545F60C000000000F878...
8 1 8XKEx4xN4E7jdw1L5WMxDh5xJWt4wnnT2hfWPud2lhE \x138b408566 0101000020E610000000000000985453C0000000002876...
9 1 Lrr/WqThxOK1MfS6c2s0cjxKNBEB3yAQIdMDhyhwxKc \x138b402e44 0101000020E610000000000000985453C0000000002876...

Add an index on column user_guid, so we can export the CSV sorted/grouped by user_guid (= all posts of a single user together).

In [18]:
sql_query = """
CREATE INDEX idx_latlng_de_hll_user_guid 
ON spatial.latlng_de_hll (user_guid);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE INDEX
In [19]:
db_connection_hll.commit()

Export data as CSV

Save hll data to CSV. The following records are available from table spatial.latlng:

  • distinct origin_id plus user_guid and latitude and longitude coordinates (clear text/hashed), this is the "base" we're working with
  • userday_hll - approximate user days stored as hll set
In [20]:
def get_latlng_fromdb(
        table_ref: str, chunk_size: int = 5000000) -> List[pd.DataFrame]:
    """Returns spatial.latlng data from db, excluding Null Island

    Important: The data will be grouped/sorted by hashed user_guid.
    """
    sql = f"""
    SELECT  origin_id,
            user_guid,
            ST_Y(p1.latlng_geom) As "latitude",
            ST_X(p1.latlng_geom) As "longitude",
            userday_hll
    FROM {table_ref} p1
    ORDER BY user_guid
    """
    # execute query, enable chunked return
    return pd.read_sql(sql, con=db_connection_hll, chunksize=chunk_size)

def write_chunkeddf_tocsv(
    filename: str, usecols: List[str], chunked_df: List[pd.DataFrame],
    chunk_size: int = 5000000):
    """Write chunked dataframe to CSV"""
    for ix, chunk_df in enumerate(chunked_df):
        mode = 'a'
        header = False
        if ix == 0:
            mode = 'w'
            header = True
        chunk_df.to_csv(
            filename,
            mode=mode, columns=usecols,
            index=False, header=header)
        clear_output(wait=True)
        display(
            f"Stored {(ix*chunk_size)+len(chunk_df)} "
            f"post-locations to CSV..")

Execute Query:

In [22]:
%%time
root = Path.cwd().parents[0] / "00_data"
usecols = ["origin_id", "user_guid", "latitude", "longitude", "userday_hll"]
filename = root / "2025-09-19_userdays_DE_HLL.csv"
if Path(filename).exists():
    print(f"CSV already exists, skipping load from db.. (to reload, delete file)")
else:
    table_ref = f"spatial.latlng_de_hll"
    write_chunkeddf_tocsv(
        chunked_df=get_latlng_fromdb(
            table_ref=table_ref),
        filename=filename,
        usecols=usecols)
'Stored 27409403 post-locations to CSV..'
CPU times: user 1min 42s, sys: 15.3 s, total: 1min 57s
Wall time: 2min 47s
In [23]:
%%time
data_files = {
    "LatLng-userday (DE)":filename, 
    }
tools.display_file_stats(data_files)
name LatLng-userday (DE)
size 2.53 GB
records 27,409,404
CPU times: user 2.77 s, sys: 468 ms, total: 3.24 s
Wall time: 3.23 s

Export lat/lng data

We will also need to export the full-resolution lat/lng data for all posts in DE. This needs to be run in the SQL backend:

COPY (
    SELECT
        user_guid,
        ST_Y(post_latlng) AS latitude,
        ST_X(post_latlng) AS longitude
    FROM
        mviews.all_posts_de
)
TO '/pg_exp/2025-09-30_DE_coords_user.csv'
WITH (FORMAT CSV, HEADER, ENCODING 'UTF-8');
> COPY 66632117
> Time: 131900.793 ms (02:11.901)

Create notebook HTML

In [ ]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./00_user_origin_conversion.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]:
 

IOER FDZ Jupyter Base Template v0.13.0