Digital Landscape Traces: Inferring User Origin (Local vs. Tourist) ¶
Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)
This notebook documents the data preparation process for differentiating between "locals" and "tourists" based on their social media activity. The primary goal is to determine a probable home location for each user who has posted from within Germany. This allows us to classify them and subsequently analyze how these different groups interact with and perceive German landscapes.
The workflow involves several key stages:
- Identify Active Users: We begin by isolating the 6.3 million users who have posted at least once from Germany out of a global dataset of 127.6 million users.
- Gather Global Activity: For this cohort of users, we retrieve the locations of all their posts worldwide. A user's complete spatial footprint is necessary to reliably infer a home location.
- Home Location Inference: We then process this location history. By clustering the spatiotemporal data (specifically, "user-days," which represent a unique day a user was active in a location), we can identify the most probable home location for each individual.
- Classification: Based on whether a user's inferred home is inside or outside Germany, we classify them as a "local" or a "tourist." Users with insufficient data or ambiguous patterns are classified as "undetermined."
- Privacy-Preserving Export: Finally, the classified user data is aggregated and exported. To protect user privacy while enabling reproducibility, we use privacy-enhancing techniques, such as HyperLogLog (HLL), to create a dataset that supports aggregated mapping and analysis without exposing individual-level data.
Prepare environment¶
To run this notebook, as a starting point, you have two options:
1. Create an environment with the packages and versions shown in the following cell.
As a starting point, you may use the latest conda environment_default.yml from our CartoLab docker container.
2. If docker is available to to, we suggest to use the Carto-Lab Docker Container
Clone the repository and edit your .env value to point to the repsitory, where this notebook can be found, e.g.:
git clone https://github.com/ioer-dresden/carto-lab-docker.git
cd carto-lab-docker
cp .env.example .env
nano .env
## Enter:
# JUPYTER_NOTEBOOKS=~/notebooks/ephemeral_events
# TAG=v0.12.3
docker network create lbsn-network
docker compose pull && docker compose up -d
Load dependencies:
import os, sys
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict, Optional
from IPython.display import clear_output, display, HTML
To reduce the code shown in this notebook, some helper methods are made available in a separate file.
Load helper module from ../py/modules/base/tools.py.
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules.base import tools
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Parameters¶
Define initial parameters that affect processing
A 6-character geohash is aproximate to a 5 km × 5 km grid. That is precise enough to distinguish neighborhoods but not individual buildings. Reduce to increase privacy. Increase to improve granularity.
# Initial accuracy to reduce granularity
# of spatial data on hll convert
GEOHASH_PRECISION = 5
WORK_DIR = Path.cwd().parents[0] / "tmp" # Working directory
OUTPUT = Path.cwd().parents[0] / "out" # Define path to output directory (figures etc.)
for folder in [WORK_DIR, OUTPUT]:
folder.mkdir(exist_ok=True)
Load dotfiles environment variables
from dotenv import load_dotenv
load_dotenv(
Path.cwd().parents[0] / '.env', override=True)
DB_NAME_RAWDB = os.getenv("DB_NAME_RAWDB") # lbsn-rawdb name
DB_HOST_RAWDB = os.getenv("DB_HOST_RAWDB") # lbsn-rawdb name
Initial Query for DE posts¶
First, create a table with the Geometry for Germany
CREATE TABLE spatial.germany_boundary (name VARCHAR(255), geom GEOMETRY(Polygon, 4326));
INSERT INTO spatial.germany_boundary (name, geom) VALUES ('Germany', ST_SetSRID(ST_GeomFromText(
'SRID=4326; Polygon ((9.1433304674271767 54.87302263558621007, 9.29480994002676653 54.80168366606841346, 9.59188580908289623 54.8869844880437654, 9.89410309238382979 54.84183198146081395, 10.59215134630761668 54.51989772869728768, 11.14133007712143808 54.57629296235040783, 11.64437791630786023 54.33073223410536912, 12.16520515899094335 54.38463668923185423, 12.32216612986303517 54.57714431725651139, 13.06899846071041793 54.84973594845530442, 13.42317610499327429 54.90912276250554669, 13.6613590011170345 54.86449207854661836, 14.05716348310829744 54.61181819135651949, 14.06966560638267083 54.27763627849151362, 14.16803558587361067 54.23880270088397992, 14.24221454696379396 53.98771964604486584, 14.18527042109309377 53.91197215758904804, 14.45052538304003065 53.26225137859830738, 14.34852597989009837 53.05471971874010251, 14.1436577746865737 52.96136543546651865, 14.12288246055715035 52.83765843329013023, 14.63909584120017371 52.57299936662752771, 14.53435722180836365 52.39500777578825819, 14.57599486577049674 52.28835922389913549, 14.71568838750914665 52.23588560993300689, 14.68153710045686466 52.11665381861888591, 14.75899936762675679 52.06476543577628036, 14.59014437256479368 51.82100998772079947, 14.75750455139399264 51.66150559111019902, 14.72911862642354208 51.53143962124158861, 14.97434511713350958 51.44213090221938955, 15.04181552066978611 51.27421588035866762, 14.9793136600973753 51.07703822223624002, 14.76638305793181871 50.81933252297761783, 14.61880015620226914 50.85780455873344863, 14.65014207539650215 50.9315402808593376, 14.56403797087568819 50.91849758409625792, 14.59917754360878916 50.98711099987281159, 14.50821195321913137 51.04315459410929634, 14.30164070725288639 51.05495050765983933, 14.25854303463341921 50.98751826955253819, 14.3876350063505356 50.89908797413073671, 13.90038356464981462 50.79324026222838029, 13.85526158645308215 50.72711999918868742, 13.55215096069832725 50.71383131086633966, 13.46562195618838587 50.60232153031444113, 13.37098295180180685 50.65054589437755794, 13.19525624507969042 50.50326319983685153, 13.03213105020785179 50.50996067954179125, 12.948144730395029 50.404311645865846, 12.81939501546764859 50.4597950326146929, 12.51204793536396664 50.39725816862960528, 12.33108448895802667 50.24271404931397456, 12.33320130239820855 50.17155191349007737, 12.18445310862409769 50.32232471681226116, 12.10019899948429156 50.31832794437917755, 12.2014080343533351 50.10849193455516115, 12.54782478838330917 49.92026953740929684, 12.40074313458802635 49.75553183832374771, 12.52185046292856896 49.6865330432653991, 12.65587915330036139 49.43453810554916572, 13.02921064395610529 49.30451982223206642, 13.40289555288751444 48.9873427339800287, 13.62827657000775616 48.94930714460906529, 13.83973263295467859 48.77148568211856627, 13.82569624828329324 48.61841526565206095, 13.73062545330208195 48.51446092253360121, 13.50879208898379602 48.59022953606427109, 13.32976817266307989 48.32359206612102298, 12.75814504086099532 48.12619261719555652, 13.00253663695627893 47.85328766219589625, 12.90516623683242869 47.72360115407008152, 13.08022144406442067 47.68711627559571298, 13.04750486166005885 47.49208302571111062, 12.80384823830559071 47.54988690075992963, 12.78106185032419262 47.67414247229731927, 12.49888118252061986 47.62494554180335626, 12.44012646416501866 47.69517254637594306, 12.2551562061939876 47.67929967054340068, 12.25707210839112804 47.74283907176578623, 12.16257153215053677 47.70112119928398897, 12.20320083824810808 47.60655540780589945, 11.63551183390404731 47.5944925556869407, 11.2728918533763931 47.39785256984026773, 10.9703287861445915 47.40002058258386342, 10.89015587599294577 47.53719340445212538, 10.45445953557913299 47.55573834235065078, 10.43647213957763142 47.38064780345342797, 10.17419188314892153 47.27020119333047887, 10.23594858525905238 47.38181776567739689, 10.09978048238338033 47.35473600246461956, 10.09136801981833287 47.45875007585539151, 9.97084547833622992 47.54568017799789459, 9.77728950610588754 47.59490289205490399, 9.56184471246353951 47.50455693005835656, 9.25583223976610725 47.65909750244782117, 8.89535179764737904 47.64909162917541607, 8.80644066260651925 47.73826225010282087, 8.79566886012986515 47.675586000193789, 8.72795487162005657 47.69268014007644751, 8.65697507693829493 47.80034710313634605, 8.40439955129829741 47.69805376287092713, 8.62888483466741718 47.65174054269192538, 8.43714312685386858 47.56708343205480105, 8.20607236264947915 47.62099989582856097, 7.67923753718264379 47.53277577987836366, 7.63407099577989356 47.5611040421179041, 7.69355335740158353 47.60053205810805821, 7.60448410715588352 47.57779110297707348, 7.52107927548820498 47.66384788966843189, 7.6221212397035174 47.97279107394894027, 7.5769816401905814 48.11905638794200968, 7.74340888840953312 48.32546477927621709, 7.80214138631595233 48.58884316637909251, 8.23263282034844224 48.96657144521606142, 7.93564809218577238 49.05765511479177121, 7.63120134636511693 49.0548857056109, 7.44564667835157934 49.18415102884694079, 7.29317690469042645 49.11490732861039987, 7.05244591624608574 49.11275784369956909, 6.93810712304531307 49.22243763608810241, 6.73839761345914212 49.16359779690088772, 6.55081584387613702 49.42537952404586576, 6.36703072508794321 49.4694830820786251, 6.35758927802282869 49.57347554655927979, 6.52992601637259895 49.80704757898263324, 6.32226166784175803 49.8392608441127436, 6.11309021923830187 50.06076301060448941, 6.17568740345097922 50.23541597716564411, 6.40502825514988672 50.32330869294190734, 6.34221872841163758 50.38000552503501694, 6.37487303006633965 50.45124892989423415, 6.1984988852154288 50.52836615777204088, 6.26559804771176232 50.64270602206534022, 5.97480474174545861 50.7979707384588437, 6.09389718580655426 50.92131540468221829, 5.89752511249400868 50.97501346989577087, 5.86687398774438407 51.04838894699236107, 6.17541526680116704 51.15847977272341041, 6.08215105872443473 51.17179911623628641, 6.07213427667556971 51.24245499220415923, 6.22612991836109586 51.36051764507328699, 6.21230003242061457 51.51361357530629448, 6.09123534633420149 51.60590930886758088, 6.11820963876385804 51.65610310637333669, 5.95510381438958802 51.73828468199252484, 5.94515322244330946 51.82451729513712735, 6.1660793929496549 51.8409024535854428, 6.10334125709340469 51.89261697229551373, 6.15511712665414734 51.90541782067595022, 6.40506434674227876 51.82740354243958336, 6.3907900554168009 51.8742190043036544, 6.72166407024070622 51.89617477214807195, 6.82851309630240166 51.96406670771244762, 6.69507834878680796 52.07026212539682319, 7.0609954894503062 52.23478240438453213, 7.07243800091305275 52.37311639220780535, 6.98785513290812332 52.46916016640858516, 6.69772000387513344 52.48634521246162876, 6.6814024699480683 52.55359869772178172, 6.76649011302833969 52.56230612083083997, 6.70973231049777041 52.62782351794106717, 6.75214101519571841 52.6482893402808827, 7.05100158975159275 52.63994356514348283, 7.08700963868392364 52.85051212403533327, 7.21693779014697157 53.0067232392861456, 7.19205451590323719 53.3143070866028097, 6.9316102596058613 53.33411061581233525, 6.88866268113787328 53.44247621270710624, 6.68362781717007692 53.49664088127173045, 6.34585998714874222 53.72450141470858398, 6.72671131827274849 53.86254353925301075, 7.72794138558651866 53.99333519546297566, 7.53412317824211186 54.13912000258608259, 7.58474754686181996 54.31008616784242804, 7.84582089511826553 54.39292727477396738, 8.16128570027490241 54.33250950670844759, 7.93236566924031195 54.73415971779678557, 8.05011711081607473 55.09845306415763844, 8.47261129387618439 55.05409929404726199, 8.55470193184874006 54.92094757397143212, 9.1433304674271767 54.87302263558621007))'
), 4326));
To speed up, create a spatial index:
CREATE INDEX idx_germany_boundary_geom ON spatial.germany_boundary USING GIST (geom);
ANALYZE spatial.germany_boundary;
The original big query to select all posts for Germany is shown below. This must be executed directly via postgres terminal, ideally with a terminal multiplexer.
CREATE MATERIALIZED VIEW mviews.all_posts_de AS
SELECT
p.user_guid,
ST_Y(p.post_latlng) AS latitude,
ST_X(p.post_latlng) AS longitude
p.post_create_date,
p.post_publish_date
FROM
topical.post AS p
JOIN
spatial.germany_boundary AS de
ON
ST_Intersects(p.post_latlng, de.geom)
To also show posts outside of Germany, as surrounding context (without classification and toned down in opcaity), the following query was used:
COPY (
WITH de_geometries AS (
SELECT
geom::geometry AS germany_geom
FROM
spatial.germany_boundary
LIMIT 1
),
expanded_bbox AS (
SELECT
ST_Buffer(
(ST_Envelope(germany_geom))::geography,
100000
)::geometry AS buffered_geom,
germany_geom
FROM
de_geometries
)
SELECT
ST_Y(p.post_latlng) AS latitude,
ST_X(p.post_latlng) AS longitude
FROM
topical.post AS p,
expanded_bbox
WHERE
ST_Intersects(p.post_latlng, expanded_bbox.buffered_geom)
AND NOT ST_Intersects(p.post_latlng, expanded_bbox.germany_geom)
)
TO '/pg_exp/2025-10-18_DE_outside_buffer_coords.csv'
WITH (FORMAT CSV, HEADER, ENCODING 'UTF-8');
Connecting to the Databases¶
Our setup uses two PostgreSQL databases. The first (lbsn-rawdb) contains the original, sensitive social media data. The second (hlldb) is our local working database, where we will perform calculations and store aggregated, privacy-preserving results using the HyperLogLog (HLL) extension. This separation ensures that the raw data remains isolated and secure.
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"
db_connection_hll = psycopg2.connect(
host=db_host,
port=db_port,
dbname=db_name,
user=db_user,
password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
Simplify query access:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Create Query Schema¶
Create a new schema called mviews and update Postgres search_path parameter, to include new schema:
sql_query = """
CREATE SCHEMA IF NOT EXISTS mviews;
ALTER DATABASE hlldb
SET search_path = "$user",
social,
spatial,
temporal,
topical,
interlinkage,
extensions,
mviews;"""
Since the above query will not return any result, we'll directly use the psycopg2 cursor object:
cur = db_connection_hll.cursor()
cur.execute(sql_query)
print(cur.statusmessage)
To access the raw post data without copying it into our working database, we use a PostgreSQL feature called a Foreign Data Wrapper (FDW). This creates a secure, read-only link from hlldb to the tables in lbsn-rawdb. This approach is highly efficient for large datasets and maintains a clear separation between raw and processed data.
On hlldb, install postgres_fdw extension:
sql_query = """
CREATE EXTENSION IF NOT EXISTS postgres_fdw SCHEMA extensions;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
MVIEWS_REF = "all_posts_de"
Check if foreign table has been imported already:
sql_query = f"""
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'mviews'
AND table_name = '{MVIEWS_REF}'
);
"""
result = db_conn_hll.query(sql_query)
result["exists"][0]
Conditional load password - this only need to be done once, if the server hasn't been added before.
Enter the password for pg user lbsn_reader
USER_KEY = None
if not result["exists"][0]:
import getpass
USER_KEY = getpass.getpass()
Create Foreign Server connection to rawdb, on hlldb:
sql_query = f"""
CREATE SERVER IF NOT EXISTS lbsnraw
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
host '{DB_NAME_RAWDB}',
dbname '{DB_HOST_RAWDB}',
port '5432',
keepalives '1',
keepalives_idle '30',
keepalives_interval '10',
keepalives_count '5',
fetch_size '500000');
CREATE USER MAPPING IF NOT EXISTS for postgres
SERVER lbsnraw
OPTIONS (user 'lbsn_reader', password '{USER_KEY}');
"""
if not result["exists"][0]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Optional cleanup step (deactivated cell):
If result["exists"][0]: Import foreign table definition on the hlldb.
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
LIMIT TO (
{MVIEWS_REF})
FROM SERVER lbsnraw
INTO mviews;
"""
# only import table
# if it hasn't been imported already
if not result["exists"][0]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
test
db_conn.query(f"SELECT * FROM mviews.{MVIEWS_REF} LIMIT 10;")
If you see a permission denied for materialized view ..., grant the rights first:
GRANT SELECT ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;
Commit changes to hlldb (deactivated, enable manually after check):
Preview samples¶
1Instagram2Flickr3Twitter4Facebook5Foursquare8Reddit23iNaturalist"
cols = "origin_id, post_body, hashtags, emoji"
samples = db_conn.query(
f"""
(SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 1 LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 2 LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 3 LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF} WHERE origin_id = 23 LIMIT 10);
""")
from pandas import option_context
with option_context('display.max_colwidth', 400):
display(samples.head(40))
Extract all user posts globally for those users who have been active in Germany¶
Working with user data to extract home locations is challenging, due to privacy conflicts. What we want is an optimal tradeoff between acceptable granularity of results (estimated home location) and largest possible granularity of location information, to prevent compromising user privacy. The output only needs to be a three-fold classifier for each user who has been active in Germany:
1- tourist (home location outside of Germany)2- local (home location inside Germany)3- both (either not enough data available or ambiguous results)
We can store this output in a HLL shard or a raw list of user IDs and classification. To reach this output, follow this list of steps:
- Extract raw list of all unique user ids active in Germany
- Extract raw list of all post or userday-locations per user (use Geohash, of e.g.
6, to reduce spatial granularity) 2a. Exclude all users with less than < 50 posts (e.g.), classify these IDs as "both" - For all other users, identify most probably home location based on HDBSCAN Clustering; store this as lat/lng coordinate
- Classify in either local, or tourist, based on lat/lng intersection with Germany
We can make use of HyperLogLog, to a small degree. For instance, we could use HLL to aggregate userdays per GeoHash for each users, so that we have an immediate aggregation step that reduces the bulk of post data a bit.
HyperLogLog parameters
The HyperLogLog extension for Postgres from Citus that we're using here, contains several tweaks, to optimize performance, that can affect sensitivity of data.
From a privacy perspective, for example, it is recommended to disable explicit mode.
Explicit mode? When explicit mode is active, full IDs will be stored for small sets. In our case, any coordinates frequented by few users (outliers) would store full user and post IDs.
To disable explicit mode:
db_conn_hll.query("SELECT hll_set_defaults(11, 5, 0, 1);")
Data Pseudonymization and Generalization¶
Before processing, we establish two crucial functions to protect user privacy, in line with ethical research standards.
1. Cryptographic Hashing for Pseudonymization The HyperLogLog algorithm uses MurMurHash, a non-cryptographic hashing function. While fast, it can be vulnerable to certain attacks (see Desfontaines et al. 2018). To mitigate this and prevent the re-identification of users, we add a robust layer of security. We use the pgcrypto extension to apply a seeded cryptographic hash (SHA-256 with HMAC) to all user identifiers.
This process converts a user's original ID into a secure, non-reversible pseudonym. The use of a secret key (or "salt)") ensures that the same user ID will always produce the same pseudonym within our dataset, but this pseudonym cannot be reverse-engineered or matched against other datasets without the key.
2. Spatial Generalization with Geohash To reduce the spatial precision of location data, we snap GPS coordinates to a coarser grid using the Geohash algorithm. This function rounds coordinates to a specified precision, effectively aggregating points within a defined area. This protects privacy by obfuscating exact locations while retaining sufficient detail for neighborhood-level or regional analysis.
References:
Desfontaines, D., Lochbihler, A., & Basin, D. (2018). Cardinality Estimators do not Preserve Privacy. 1–21.
Create the pgcrypto extension:
sql_query = "CREATE EXTENSION IF NOT EXISTS pgcrypto SCHEMA extensions;"
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Prepare cryptographic hash function. The following function will take an id and a key (the seed value) to produce a new, unique hash that is returned in hex encoding.
sql_query = """
/* Produce pseudonymized hash of input id with skey
* - using skey as seed value
* - sha256 cryptographic hash function
* - encode in base64 to reduce length of hash
* - remove trailing '=' from base64 string
* - return as text
*/
CREATE OR REPLACE FUNCTION
extensions.crypt_hash (id text, skey text)
RETURNS text
AS $$
SELECT
RTRIM(
ENCODE(
HMAC(
id::bytea,
skey::bytea,
'sha256'),
'base64'),
'=')
$$
LANGUAGE SQL
STRICT;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Note: Cryptographic hashing alone will only produce pseudonymized data, since any id still relates to a single user (or post, etc.). Therefore, pseudonymization is considered a weak measure, which can be easily reversed, e.g. through [rainbow tables] or context lookup.
It is used here as an additional means to protect HLL sets from intersection attacks, e.g. as is discussed by Desfontaines et al. (2018).
What is a seed value? The seed value (skey) is a secret that is used, together with the encryption function (e.g. sha256) to produce a unique, collision-free output value (the hashed id). The same ID will be converted to a different hash if the seed value is changed. Therefore, in our case, the seed value must remain the same during the entire processing of data. In this case, the seed is called a key. This key can be destroyed afterwards, if no subsequent updates are necessary.
sql_query = """
/* Reduce spatial granularity of coordinates by GeoHash
- will keep Null Island unmodified
*/
CREATE OR REPLACE FUNCTION
extensions.geohash_reduce (IN coord geometry, geohash integer DEFAULT 5)
RETURNS geometry
AS $$
SELECT
CASE WHEN ST_Y(coord) = 0 AND ST_X(coord) = 0
THEN
coord
ELSE
ST_PointFromGeoHash(ST_GeoHash(coord, geohash), geohash)
END as "coord"
$$
LANGUAGE SQL
STRICT;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Step 1: Extract Unique Users Active in Germany¶
The first step in our analysis is to create a definitive list of all users who have posted at least one time from within Germany. We create a materialized view for this list, mviews.unique_user_de, which improves performance in subsequent queries.
Optional cleanup step:
%%time
sql_query = f"""
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.unique_user_de AS
SELECT DISTINCT t1.origin_id, t1.user_guid
FROM mviews.all_posts_de t1;
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Consistency check and validation: The above 6,233,753 exact user counts confirm our earlier abstracted HLL cardinality estimation. We can also query the count, to check again:
db_conn_hll.query(f"SELECT count(user_guid) FROM mviews.unique_user_de;")
db_connection_hll.commit()
Step 2: Fetch Global Post History for German Users¶
With our list of Germany-active users, we now need to retrieve their entire global posting history. This is the most computationally intensive step, as it involves joining our list of ~6 million users against the main posts table containing hundreds of millions of entries.
The following query is executed directly on the lbsn-rawdb server for performance reasons. It performs several actions simultaneously:
- It joins the
unique_user_delist with the maintopical.posttable. - It applies the cryptographic hash function (
extensions.crypt_hash) to theuser_guidto pseudonymize the data at the source. - It reduces the precision of the coordinates using our
geohash_reducefunction. - It creates a
userday identifier by concatenating the hashed user ID with the date. This unique marker is central to our home location inference, as it represents a day of activity for a specific user.
First, to speed up the calculation, we will add an index to the materialized view, since indexes are not inherited.
%%time
sql_query = f"""
CREATE INDEX idx_uniqueuser_userguid
ON mviews.unique_user_de (user_guid);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Ok, the above query cannot be compared because topical.post is on a foreign table, and comparing 800M rows through the fdw wrapper is impractical.
byobu
docker exec -it lbsn-rawdb /bin/bash
psql -h localhost -p 5432 -U postgres rawdb
-- Enable timing:
\timing
-- test
select * from mviews.all_posts_de limit 1;
Therefore, execute both these queries below on the foreign database directly, connected to the postgres backend, as it may take a while to compute.
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.unique_user_de AS
SELECT DISTINCT t1.origin_id, t1.user_guid
FROM mviews.all_posts_de t1;
CREATE INDEX idx_uniqueuser_userguid
ON mviews.unique_user_de (user_guid);
Below is the "big" query:
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.posts_from_de_users AS
SELECT p.origin_id,
extensions.crypt_hash(p.user_guid, '[redacted]') AS user_guid,
ST_Y(extensions.geohash_reduce(p.post_latlng, {GEOHASH_PRECISION})) AS latitude,
ST_X(extensions.geohash_reduce(p.post_latlng, {GEOHASH_PRECISION})) AS longitude,
(extensions.crypt_hash(p.user_guid, '[redacted]') || to_char(COALESCE(NULLIF(p.post_create_date, ''), p.post_publish_date), 'yyyy-MM-dd')) AS userday
FROM topical.post p
JOIN mviews.unique_user_de u
ON p.user_guid = u.user_guid;
> SELECT 291557025
> Time: 7977669.774 ms (02:12:57.670)
The query took about 2 hours and selected 291,557,025 Million posts from German tourists or locals worldwide.
Step 3: Filter Data for Reliable Home Location Inference¶
Before we can infer home locations, we must filter the 292 million collected posts to improve the quality and reliability of our input data. There are two main filtering criteria:
- User Activity: Reliably inferring a home location requires a sufficient amount of data per user. Following best-practice parameter selection from previous study results (e.g., Huang et al., 2014), we set a minimum threshold of user activity. In this case, we exclude users with fewer than 30 unique userdays (days on which they posted). This removes users whose data is too sparse to yield a meaningful result.
- Invalid Coordinates: The dataset may contain erroneous geotags, with a common error being posts located at (
0.0, 0.0), often referred to as "Null Island". Since these points provide no valid geographical information, they are excluded from the analysis.
These filtering steps are performed on the lbsn-rawdb server to create a new materialized view, posts_from_de_users_filtered. This view contains the cleaned and pseudonymized data that will serve as the basis for our home location analysis.
Import worldwide users posts via FDW¶
Check if already imported:
MVIEWS_REF_USER = 'posts_from_de_users'
sql_query = f"""
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'mviews'
AND table_name = '{MVIEWS_REF_USER}'
);
"""
result = db_conn_hll.query(sql_query)
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
LIMIT TO (
{MVIEWS_REF_USER})
FROM SERVER lbsnraw
INTO mviews;
"""
# only import table
# if it hasn't been imported already
if not result["exists"][0]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
If you see a permission denied for materialized view ..., grant the rights first:
GRANT SELECT ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;
Make sample selectin (similar to the one we used in the beginning of the notebook).
cols = "origin_id, user_guid, latitude, longitude, userday"
samples = db_conn_hll.query(
f"""
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 1 LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 2 LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 3 LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 23 LIMIT 10);
""")
with option_context('display.max_colwidth', 400):
display(samples.head(40))
If the sample was extracted without errors, commit changes to the local working DB:
db_connection_hll.commit()
Filter data¶
Before classification, let's filter the data for the categories of both or ambiguous. In line with the general parameter recommendations set out in the paper by Huang et al. (2014, 'From Where Do Tweets Originate?'), we can set a lower limit of 50 posts per user below which it is no longer possible to unambiguously identify the user's home location. Rather than using post count, we use userdays as a global measure of the total number of days on which the user has posted at least once. We may reduce the number of userdays to 30, to account for the slighly increased aggregation granularity.
We can also see what we can additionally filter in the query output above. The Null Island is prominently present in the first few elements selected from the database. These are erroneous posts with incorrect geotags. As they don't provide reliable geo-information for inferring home location, we can exclude them too.
Let's first test to exclude the Null Island
cols = "origin_id, user_guid, latitude, longitude, userday"
samples = db_conn_hll.query(
f"""
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 1 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 2 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 3 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER} WHERE origin_id = 23 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10);
""")
with option_context('display.max_colwidth', 400):
display(samples.head(20))
This looks good.
We are now only missing the filter step to exclude all users below 30 userdays total. We will do this together with the Null Island exclusion. And we will do it again on our foreign raw server, to not send 292 Million elements through the Foreign Data Wrapper.
First, create indexes on the original materialized view, so the query speed is increased:
-- For user-level lookups
CREATE INDEX idx_posts_from_de_users_userguid
ON mviews.posts_from_de_users (user_guid);
-- For userday-based queries
CREATE INDEX idx_posts_from_de_users_userday
ON mviews.posts_from_de_users (userday);
-- Optional for spatial queries
CREATE INDEX idx_posts_from_de_users_lat_lon
ON mviews.posts_from_de_users (latitude, longitude);
Then, create a filtered mview containing only the valid post data required to infer user home locations. We split this into two distinct queries, to speed up compute and reduce complexity of the sql query.
First, group by distinct userdays and select only userday in the output selection. We can later extract the hashed user_guid substring from the result.
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.distinct_user_days_de_users AS
SELECT userday
FROM mviews.posts_from_de_users
GROUP BY userday;
> SELECT 106540196
> Time: 96678.938 ms (01:36.679)
This step reduced the dataset from ~292 million original posts to ~106 million distinct userdays, a ~70% reduction in size.
How many users will be excluded, if we filter by > 30 userdays?
WITH excluded_users AS (
SELECT left(userday, length(userday) - 10) AS user_guid
FROM mviews.distinct_user_days_de_users
GROUP BY left(userday, length(userday) - 10)
HAVING COUNT(*) <= 30
)
SELECT COUNT(*) AS number_of_excluded_users
FROM excluded_users;
> number_of_excluded_users
> --------------------------
> 5646069
> (1 row)
>
> Time: 122854.215 ms (02:02.854)
5646069 / 6233753 * 100
This will exclude about 90% of users in our dataset. These users do not have enough user activity to derive the user homne location. These numbers tell a very important story about the nature of social media data (see the long-tail or power-law distribution).
Lastly, run the query below to create a new mview by selecting active users with > 30 distinct userdays and by filtering the Null Island.
CREATE MATERIALIZED VIEW IF NOT EXISTS mviews.posts_from_de_users_filtered AS
WITH active_users AS (
SELECT left(userday, length(userday) - 10) AS user_guid
FROM mviews.distinct_user_days_de_users
GROUP BY left(userday, length(userday) - 10)
HAVING COUNT(*) > 30
)
SELECT p.*
FROM mviews.posts_from_de_users p
JOIN active_users a
ON p.user_guid = a.user_guid
WHERE NOT (p.latitude = 0.0 AND p.longitude = 0.0);
SELECT 248294113
Time: 704626.323 ms (11:44.626)
This step filtered the dataset to ~248 million valid posts (excluding Null Island and low-activity users), down from the original ~292 million. Despite the large scale, the query completed in under 12 minutes.
Import to our local worker database.
MVIEWS_REF_USER_FILTERED = 'posts_from_de_users_filtered'
sql_query = f"""
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'mviews'
AND table_name = '{MVIEWS_REF_USER_FILTERED}'
);
"""
result = db_conn_hll.query(sql_query)
sql_query = f"""
IMPORT FOREIGN SCHEMA mviews
LIMIT TO (
{MVIEWS_REF_USER_FILTERED})
FROM SERVER lbsnraw
INTO mviews;
"""
# only import table
# if it hasn't been imported already
if not result["exists"][0]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
If you see a permission denied for materialized view ..., grant the rights first:
GRANT SELECT ON ALL TABLES IN SCHEMA mviews TO lbsn_reader;
cols = "origin_id, user_guid, latitude, longitude, userday"
samples = db_conn_hll.query(
f"""
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 1 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 2 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 3 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10)
UNION
(SELECT {cols} FROM mviews.{MVIEWS_REF_USER_FILTERED} WHERE origin_id = 23 AND NOT (latitude = 0.0 AND longitude = 0.0) LIMIT 10);
""")
with option_context('display.max_colwidth', 400):
display(samples.head(40))
Step 4: Aggregate Data for Clustering Analysis¶
This is the final data preparation step before the home location clustering can be performed (in a subsequent process). The goal here is to aggregate the data by user and unique location, counting the number of distinct user-days for each point. This creates a summary of how frequently each user visits specific locations.
To do this efficiently, we insert the filtered data into a new table in our local hlldb, named spatial.latlng_de_hll. The query performs the following aggregation:
- It groups the data by user_guid and location (
latitude,longitude). - For each group, it uses
hll_add_aggto create a HyperLogLog structure containing all the unique userday hashes. The cardinality of this HLL set gives us a highly accurate and memory-efficient estimate of the number of days a user was active at that specific location.
The resulting table provides a compact summary for each user, listing all the locations they have visited and the number of days they spent at each, ready for a density-based clustering algorithm like HDBSCAN to identify their most probable home location.
sql_query = """
CREATE TABLE IF NOT EXISTS spatial.latlng_de_hll (
origin_id int,
user_guid text,
userday_hll hll,
latlng_geom geometry(Point, 4326)
);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Store changes to DB.
db_connection_hll.commit()
Aggregate HLL by user + lat/lng:
- connect to the backend of the local hll db
- execute directly in the pgsql shell
byobu
docker exec -it lbsn-hlldb /bin/bash
psql -h localhost -p 5432 -U postgres hlldb
-- Enable timing:
\timing
-- test
select * from mviews.posts_from_de_users_filtered limit 1;
> origin_id | user_guid | latitude | longitude | userday
> -----------+---------------------------------------------+-----------------+-----------------+-------------------------------------------------------
> 1 | DxC+3nuAOXEF5fQN8ngX3AFDLuAF/algMxhp0dpbrU4 | -37.81494140625 | 144.99755859375 | DxC+3nuAOXEF5fQN8ngX3AFDLuAF/algMxhp0dpbrU42017-10-19
> (1 row)
On the raw/remote table: Add indexes
CREATE INDEX idx_posts_grouping
ON posts_from_de_users_filtered (origin_id, latitude, longitude, user_guid);
On the local/hll db, set HLL aggregation parameters (see above):
SELECT hll_set_defaults(11, 5, 0, 1);
- this disables explicit mode, which is important for privacy preservation
Then run the aggregation query on the local/hlldb.
INSERT INTO spatial.latlng_de_hll(
origin_id,
user_guid,
userday_hll,
latlng_geom)
SELECT origin_id,
user_guid,
hll_add_agg(hll_hash_text(userday)) AS userday_hll,
ST_SetSRID(ST_MakePoint(longitude, latitude), 4326) AS latlng_geom
FROM mviews.posts_from_de_users_filtered
GROUP BY origin_id, latitude, longitude, user_guid;
> INSERT 0 27,409,403
> Time: 1479633.236 ms (24:39.633)
In this final data preparation step, the raw post data is transformed into a compact summary of each user's geographic footprint. The initial filtered table contained approximately 248 million rows, with each row representing a single post from a user. The aggregation query then collapses this data, creating just one row for each unique combination of a user and a specific location (geohash cell). Consequently, if an individual posted hundreds of times from the same location across different days, all those entries are now represented by a single row, stored in the userday_hll column. The detailed temporal information is not lost; rather, the number of distinct days the user was active at that spot is efficiently stored within the userday_hll data structure and can be estimated with a HLL cardinality calculation. This crucial transformation resulted in a significant and expected reduction in dataset size from 248 million rows to roughly 27 million rows, producing a clean, efficient table where each row signifies a unique place a person has visited, perfectly setting up the data for clustering to determine their home location.
Have a look at the resulting data structure.
db_conn_hll.query(f"SELECT * FROM spatial.latlng_de_hll LIMIT 10;")
Add an index on column user_guid, so we can export the CSV sorted/grouped by user_guid (= all posts of a single user together).
sql_query = """
CREATE INDEX idx_latlng_de_hll_user_guid
ON spatial.latlng_de_hll (user_guid);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
db_connection_hll.commit()
Export data as CSV¶
Save hll data to CSV. The following records are available from table spatial.latlng:
- distinct
origin_idplususer_guidandlatitudeandlongitudecoordinates (clear text/hashed), this is the "base" we're working with userday_hll- approximate user days stored as hll set
def get_latlng_fromdb(
table_ref: str, chunk_size: int = 5000000) -> List[pd.DataFrame]:
"""Returns spatial.latlng data from db, excluding Null Island
Important: The data will be grouped/sorted by hashed user_guid.
"""
sql = f"""
SELECT origin_id,
user_guid,
ST_Y(p1.latlng_geom) As "latitude",
ST_X(p1.latlng_geom) As "longitude",
userday_hll
FROM {table_ref} p1
ORDER BY user_guid
"""
# execute query, enable chunked return
return pd.read_sql(sql, con=db_connection_hll, chunksize=chunk_size)
def write_chunkeddf_tocsv(
filename: str, usecols: List[str], chunked_df: List[pd.DataFrame],
chunk_size: int = 5000000):
"""Write chunked dataframe to CSV"""
for ix, chunk_df in enumerate(chunked_df):
mode = 'a'
header = False
if ix == 0:
mode = 'w'
header = True
chunk_df.to_csv(
filename,
mode=mode, columns=usecols,
index=False, header=header)
clear_output(wait=True)
display(
f"Stored {(ix*chunk_size)+len(chunk_df)} "
f"post-locations to CSV..")
Execute Query:
%%time
root = Path.cwd().parents[0] / "00_data"
usecols = ["origin_id", "user_guid", "latitude", "longitude", "userday_hll"]
filename = root / "2025-09-19_userdays_DE_HLL.csv"
if Path(filename).exists():
print(f"CSV already exists, skipping load from db.. (to reload, delete file)")
else:
table_ref = f"spatial.latlng_de_hll"
write_chunkeddf_tocsv(
chunked_df=get_latlng_fromdb(
table_ref=table_ref),
filename=filename,
usecols=usecols)
%%time
data_files = {
"LatLng-userday (DE)":filename,
}
tools.display_file_stats(data_files)
Export lat/lng data¶
We will also need to export the full-resolution lat/lng data for all posts in DE. This needs to be run in the SQL backend:
COPY (
SELECT
user_guid,
ST_Y(post_latlng) AS latitude,
ST_X(post_latlng) AS longitude
FROM
mviews.all_posts_de
)
TO '/pg_exp/2025-09-30_DE_coords_user.csv'
WITH (FORMAT CSV, HEADER, ENCODING 'UTF-8');
> COPY 66632117
> Time: 131900.793 ms (02:11.901)
Create notebook HTML¶
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./00_user_origin_conversion.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-