Digital Landscape Traces: Clustering

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

No description has been provided for this image
•••
Out[1]:

Last updated: Nov-03-2025, Carto-Lab Docker Version 1.1.0

This notebook continues the analysis of social media activity to determine user home locations. It takes the privacy-preserving, HLL-aggregated dataset of worldwide user activity generated in the previous notebook and performs the final home location inference for approximately 6 million users (who were active in Germany at least once).

The primary challenge at this stage is processing the multi-gigabyte dataset efficiently without loading it into memory. This notebook implements a scalable, resumable batch processing workflow with the following key stages:

  1. Memory-Efficient Data Streaming: The 2.5 GB sorted CSV file is read using a custom Python generator. This function streams the data chunk by chunk, yielding the complete location history for one user at a time, keeping memory usage minimal.
  2. Spatial Aggregation by Country: For each user, instead of computationally intensive point clustering, we perform an efficient spatial join. Each of the user's location points is associated with the country polygon it falls within.
  3. Cardinality Estimation with HyperLogLog: The core of the analysis is to determine where a user is most active. Rather than simply counting posts, we aggregate the userday_hll objects for each country. By performing a hll_union operation via a high-performance PostgreSQL backend, we can accurately estimate the total number of distinct active days a user has spent in each country.
  4. Home Location Inference and Export: The country with the highest estimated user-day cardinality is designated as that user's most probable home. This result (user_guid, countr) is immediately appended to a final output CSV. The script is designed to be resumable, allowing it to be stopped and restarted without losing progress.

The final output is a simple mapping of each user to their inferred home country, which will serve as the basis for the subsequent "local" vs. "tourist" classification.

Preparations

In [2]:
import os, sys
import csv
from pathlib import Path
import psycopg2
import geopandas as gp
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from typing import Generator, Optional, Tuple
from IPython.display import clear_output, display
from pyproj import Transformer
In [3]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools, hll
from modules.base.hll import union_hll, cardinality_hll
In [4]:
%load_ext autoreload
%autoreload 2
In [5]:
OUTPUT = Path.cwd().parents[0] / "out"               # output directory for figures (etc.)
WORK_DIR = Path.cwd().parents[0] / "tmp"             # Working directory
DATA_FOLDER = Path.cwd().parents[0] / "00_data"      #  Static Data Folder
In [6]:
OUTPUT.mkdir(exist_ok=True)
(OUTPUT / "figures").mkdir(exist_ok=True)
(OUTPUT / "svg").mkdir(exist_ok=True)
WORK_DIR.mkdir(exist_ok=True)

Plot styling

In [7]:
plt.style.use('default')
In [8]:
CRS_WGS = "epsg:4326" # WGS1984
CRS_PROJ = "esri:54009" # Mollweide

Define Transformer ahead of time with xy-order of coordinates

In [9]:
PROJ_TRANSFORMER = Transformer.from_crs(
    CRS_WGS, CRS_PROJ, always_xy=True)

Set global font

In [10]:
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

Load User Location Data

We begin by loading the user location data, which was pre-aggregated and approximated (HLL) and exported to a CSV in the previous notebook. This dataset is too large to load into memory at once, so we will only preview its structure here.

Data Structure:

  • user_guid: A unique identifier for each user.
  • latitude, longitude: The coordinates of an aggregated location.
  • userday_hll: A binary HyperLogLog (HLL) object. It efficiently represents the set of distinct days the user was active at this specific coordinate.

This file is sorted by user_guid, which is a critical precondition for the efficient streaming process that follows.

In [11]:
DE_USERDAYS = DATA_FOLDER / "2025-09-19_userdays_DE_HLL.csv"

Preview CSV

In [12]:
chunks = pd.read_csv(DE_USERDAYS, chunksize=20)
df = next(chunks)
In [13]:
df.head(20)
Out[13]:
origin_id user_guid latitude longitude userday_hll
0 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.492676 13.337402 \x138b4024c1c961
1 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.492676 13.425293 \x138b4002c209420f25128116e419211fc12e442ee134...
2 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.536621 13.205566 \x138b407543
3 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.536621 13.337402 \x138b405961
4 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.536621 13.381348 \x138b403ba248015161bb62ed01
5 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 41.901855 12.458496 \x138b4044c145a18a61cf21
6 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 41.901855 12.502441 \x138b4044c1
7 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 45.241699 13.601074 \x138b4018c1
8 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 45.329590 13.557129 \x138b4068039381ad01
9 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.087402 15.446777 \x138b4027c25c67
10 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.263184 14.523926 \x138b407a41
11 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.263184 15.314941 \x138b408943
12 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.351074 15.446777 \x138b400843
13 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.570801 13.645020 \x138b406442
14 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.614746 15.886230 \x138b40efe1
15 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.658691 15.798340 \x138b407821
16 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.658691 15.842285 \x138b407821
17 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.702637 13.469238 \x138b4009a3
18 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.702637 14.304199 \x138b400581
19 2 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 47.746582 16.721191 \x138b40aea2
In [14]:
%%time
data_files = {
    "DE_USERDAYS":DE_USERDAYS, 
    }
tools.display_file_stats(data_files)
name DE_USERDAYS
size 2.53 GB
records 27,409,404
CPU times: user 2.86 s, sys: 584 ms, total: 3.44 s
Wall time: 3.44 s

Streaming and Processing User Data

To handle the multi-gigabyte CSV file without running out of memory, we'll process it as a stream. The following function reads the file chunk by chunk and yields a complete DataFrame for one user at a time. This allows us to analyze each user's entire location history individually.

In [15]:
def stream_user_locations(
    filepath: Path, chunksize: int = 100_000
) -> Generator[pd.DataFrame, None, None]:
    """
    Reads a large CSV sorted by 'user_guid' and yields a complete
    DataFrame for one user at a time. Handles multiple user 
    boundaries within a single chunk.
    """
    user_chunks = []
    current_user_id = None
    # Use pd.read_csv with chunksize to create an iterator
    chunk_iterator = pd.read_csv(
        filepath, chunksize=chunksize)

    for chunk in chunk_iterator:
        if current_user_id is None:
            # Initialize with the first user ID from the very first chunk
            current_user_id = chunk[
                "user_guid"].iloc[0]

        # Use a while loop to process the current chunk until it's empty
        # This handles chunks that contain data for multiple users
        while not chunk.empty:
            # Find the row index where the user_guid changes
            last_idx = chunk["user_guid"] \
                .searchsorted(current_user_id, side="right")

            if last_idx < len(chunk):
                # Boundary is within this chunk.
                # 1. Take the part for the current user and append it
                user_part = chunk.iloc[:last_idx]
                user_chunks.append(user_part)
                # 2. Yield the complete data for this user
                yield pd.concat(user_chunks)
                # 3. The rest of the chunk belongs to the next user(s)
                chunk = chunk.iloc[last_idx:]
                # 4. Reset for the next user
                user_chunks = []
                current_user_id = chunk["user_guid"].iloc[0]
            else:
                # This entire remaining chunk belongs to the current user.
                user_chunks.append(chunk)
                # Break the while loop to get the next chunk from the file
                break

    # After the loop, yield the data for the very last user in the file
    if user_chunks:
        yield pd.concat(user_chunks)

Test on the first user

In [16]:
user_generator = stream_user_locations(DE_USERDAYS)
df = next(user_generator)
In [17]:
len(df)
Out[17]:
5
In [18]:
df.head(10)
Out[18]:
origin_id user_guid latitude longitude userday_hll
0 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.492676 13.337402 \x138b4024c1c961
1 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.492676 13.425293 \x138b4002c209420f25128116e419211fc12e442ee134...
2 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.536621 13.205566 \x138b407543
3 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.536621 13.337402 \x138b405961
4 3 +++aXv3XJOA9l2UK/2UNtThjsvmKpTFBV7SFxXFen8A 52.536621 13.381348 \x138b403ba248015161bb62ed01

--> The first user in our dataset has 5 aggregated locations. As a test case, we will visualize these points on a map.

In [19]:
userday_locations = gp.GeoDataFrame(
    df, geometry=gp.points_from_xy(df.longitude, df.latitude), crs=CRS_WGS)
userday_locations.to_crs(CRS_PROJ, inplace=True)

Get worldmap

In [20]:
world = tools.get_shapes("world", shape_dir=OUTPUT / "shapes")
Already exists
In [21]:
world.to_crs(CRS_PROJ, inplace=True)
In [22]:
OBSERVATION_COLOR = "red"

Create legend manually

In [23]:
def plot_map(
    gdf: gp.GeoDataFrame, x_lim: [Tuple[float, float]] = None, y_lim: [Tuple[float, float]] = None,
    world: gp.GeoDataFrame = None, observation_color: str = OBSERVATION_COLOR, fig_size: Tuple[int, int] = None,
    return_ax: bool = None):
    """Plot a point map with legend and optional range-shape"""
    if fig_size is None:
        fig_size = (6, 9)
    fig, ax = plt.subplots(1, 1, figsize=fig_size)
    gdf.plot(
        ax=ax,
        markersize=.1,
        facecolor=observation_color,
        edgecolor=None,
        alpha=0.9)
    if world is not None:
        world.plot(
            ax=ax, color='none', edgecolor='black', linewidth=0.3)
    if not None in [x_lim, y_lim]:
        ax.set_xlim(*x_lim)
        ax.set_ylim(*y_lim)
    ax.axis('off')
    fig.tight_layout()
    if return_ax:
        return fig, ax
    fig.show()

plot

In [24]:
plot_map(
    userday_locations, world=world)
No description has been provided for this image

This user was obviously only active in Germany. Let's zoom in to Europe:

In [25]:
bbox_europe = -25, 35, 35, 59
# convert to Mollweide
minx, miny = PROJ_TRANSFORMER.transform(
    bbox_europe[0], bbox_europe[1])
maxx, maxy = PROJ_TRANSFORMER.transform(
    bbox_europe[2], bbox_europe[3])
In [26]:
plot_map(
    userday_locations, world=world, x_lim=(minx, maxx), y_lim=(miny, maxy))
No description has been provided for this image

Let's see if we can find a user with more userday locations, by looping through every user in the huge CSV, one at a time, until we find one with more than 500 userdays.

In [27]:
for i, user_df in enumerate(user_generator):
    user_id = user_df['user_guid'].iloc[0]
    print(f"Processing user {i+1}: {user_id} with {len(user_df)} locations...")
    # For demonstration, let's just process the first 100 users (max) and stop
    if i >= 100 or (len(user_df) >= 500):
        break
Processing user 1: +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 with 34 locations...
Processing user 2: +++zZva3gvRgQ2QVLIYnozx2mYqg3IiDEO0O8J5/pbI with 40 locations...
Processing user 3: ++/edOEVTiQxDBPYJkejpdh/ipIqjgdmWFAwL+66bj4 with 130 locations...
Processing user 4: ++0+/uX8aZy+DROJ78UqLtcTmoaC+2FzbAp6iL3liEY with 11 locations...
Processing user 5: ++0qcyrmUB3ASb/ev2f4XJ+4BmLxNrDMRhDldM3okS0 with 17 locations...
Processing user 6: ++1Ga88skVXUmFmP+ZBs+w+/lE0p2qylbtBkNbDGhRc with 37 locations...
Processing user 7: ++1VTzwt27rtSCfXIh+wMY2HE7NIm8Ct9O27W5wnAJ0 with 11 locations...
Processing user 8: ++2L7Gr5PEc0uxD11qGgvuuOUGoIgXhjWstEGEHDiTY with 4 locations...
Processing user 9: ++2Wq22jZ7QJDb+xzK2Hf+LS1gGmNOPFqHmgVVfdSUs with 18 locations...
Processing user 10: ++37vFPDtG9DcydhyDgfnp6DpN8ptAofwLcFji7RU2Q with 6 locations...
Processing user 11: ++3B7KySc5rGlY31x0ISpnU5taKAuzDdS1Yt1AKn00E with 1523 locations...
In [28]:
userday_locations = gp.GeoDataFrame(
    user_df, geometry=gp.points_from_xy(user_df.longitude, user_df.latitude), crs=CRS_WGS)
userday_locations.to_crs(CRS_PROJ, inplace=True)
In [29]:
plot_map(
    userday_locations, world=world)
No description has been provided for this image

This user was obviously active a lot in UK and Japan. We'll have to aggregate userdays to find out which country has more userdays. For aggregating userdays, we can use HLL union to sum the estimate number of distinct userdays per country. Let's look at the world geometry, to see the aggregation base:

In [30]:
world.head()
Out[30]:
geometry
country
Fiji MULTIPOLYGON (((17601618.097 -1976619.151, 175...
United Republic of Tanzania POLYGON ((3397635.186 -117460.561, 3414491.441...
Western Sahara POLYGON ((-805662.249 3368332.366, -805924.947...
Canada MULTIPOLYGON (((-9464830.276 5768352.35, -9474...
United States of America MULTIPOLYGON (((-9464830.276 5768352.35, -9246...

Connect to HLL Calculation Backend

While HLL operations can be performed in Python, they are significantly slower than optimized database implementations. To speed up cardinality estimation, we connect to a local PostgreSQL instance with the citus-hll extension installed. This database acts as a high-performance "calculation backend" for two key operations:

  1. hll_union_agg(): To merge all of a user's HLL objects for a given country.
  2. hll_cardinality(): To estimate the number of distinct days from the final unioned HLL.
In [31]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"
In [32]:
db_connection_hll = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
print(cur_hll.statusmessage)
SELECT 1

Simplify query access:

In [33]:
db_conn = tools.DbConn(db_connection_hll)
db_conn.query("SELECT 1;")
Out[33]:
?column?
0 1

Determine Home Location by Country Aggregation

For our selected user, we will now determine their most likely home country. The core logic is to spatially join the user's location points with country polygons, and then aggregate the userday_hll data for each country to find which one contains the most activity.

In [34]:
joined_gdf = gp.sjoin(
    userday_locations,
    world, how="inner", predicate='within')
In [35]:
joined_gdf.head()
Out[35]:
origin_id user_guid latitude longitude userday_hll geometry country
313 2 ++3B7KySc5rGlY31x0ISpnU5taKAuzDdS1Yt1AKn00E -45.417480 167.717285 \x138b406f638ba2 POINT (13484242.215 -5385474.353) New Zealand
314 2 ++3B7KySc5rGlY31x0ISpnU5taKAuzDdS1Yt1AKn00E -45.329590 168.728027 \x138b4008666f63 POINT (13578758.327 -5375962.631) New Zealand
315 2 ++3B7KySc5rGlY31x0ISpnU5taKAuzDdS1Yt1AKn00E -45.153809 168.771973 \x138b406f63 POINT (13608723.338 -5356922.74) New Zealand
316 2 ++3B7KySc5rGlY31x0ISpnU5taKAuzDdS1Yt1AKn00E -45.109863 167.937012 \x138b408ba2 POINT (13547953.859 -5352159.35) New Zealand
317 2 ++3B7KySc5rGlY31x0ISpnU5taKAuzDdS1Yt1AKn00E -45.065918 168.552246 \x138b4056a4 POINT (13604160.008 -5347394.596) New Zealand

Group by country and collect HLLs. This creates a Series where the index is the country and the value is a list of HLL strings.

In [36]:
country_hll_groups = joined_gdf.groupby('country')['userday_hll'].apply(list)
In [37]:
country_hll_groups[:2]
Out[37]:
country
Austria    [\x138b403d21, \x138b40e481, \x138b4037028941,...
Canada     [\x138b40ad62, \x138b406842ad62, \x138b40eaa1,...
Name: userday_hll, dtype: object

Union all HLLs for each country

In [38]:
unioned_hlls_per_country = country_hll_groups.apply(
    lambda hll_list: union_hll(hll_list, cur_hll)
)
In [39]:
unioned_hlls_per_country[:2]
Out[39]:
country
Austria    \x138b40020102a4066306c20723090109c10f6210c217...
Canada     \x138b400f2131a13e81522268426d23a901ad62ae61b5...
Name: userday_hll, dtype: object

Calculate the cardinality estimate of each resulting unioned HLL

In [40]:
userdays_per_country = unioned_hlls_per_country.apply(
    lambda final_hll: cardinality_hll(final_hll, cur_hll)
)
In [41]:
userdays_per_country[:2]
Out[41]:
country
Austria    72
Canada     12
Name: userday_hll, dtype: int64

Let's display the results on a choropleth world map. A critical first step is to dissolve the country geometries. The original world shapefile contains multiple polygons for countries with islands or exclaves (e.g., UK, France). Dissolving merges these into a single geometry per country, ensuring a clean spatial join and correct visualization. There is also a buffer operation applied (world.buffer(0.0001)), this is a fix to prevent tiny overlapping shapes see my explanation here).

In [42]:
world['geometry'] = world.buffer(0.0001)
world_dissolved = world.dissolve(by='country')
In [43]:
userdays_per_country.rename("estimated_userdays", inplace=True)
Out[43]:
country
Austria            72
Canada             12
China               3
Cuba               15
Czechia            14
Denmark             4
France             18
Germany            12
Hungary             5
India              15
Italy              41
Malaysia            5
New Zealand        20
Portugal            9
South Africa       15
Spain              21
Switzerland        51
Turkey             13
United Kingdom    918
Name: estimated_userdays, dtype: int64

Create map

In [44]:
map_data = world_dissolved.join(userdays_per_country)
map_data["estimated_userdays"] = map_data["estimated_userdays"].fillna(0)
display(map_data[map_data['estimated_userdays'] > 0])
geometry estimated_userdays
country
Austria POLYGON ((1322649.394 5675570.066, 1322649.394... 72.0
Canada MULTIPOLYGON (((-5052148.683 5559638.422, -505... 12.0
China MULTIPOLYGON (((10629598.599 2235514.57, 10629... 3.0
Cuba POLYGON ((-7826697.878 2837014.777, -7826697.8... 15.0
Czechia POLYGON ((1125436.794 5988867.241, 1125436.794... 14.0
Denmark MULTIPOLYGON (((859965.608 6497626.273, 859965... 4.0
France MULTIPOLYGON (((5336545.596 -5728729.523, 5336... 18.0
Germany POLYGON ((1018674.63 6261050.121, 1018674.63 6... 12.0
Hungary POLYGON ((1714056.797 5707262.82, 1714056.797 ... 5.0
India POLYGON ((9017301.318 3439700.743, 9017301.318... 15.0
Italy MULTIPOLYGON (((1338964.919 4590742.897, 13389... 41.0
Malaysia MULTIPOLYGON (((11795457.284 511416.598, 11795... 5.0
New Zealand MULTIPOLYGON (((14063357.618 -5125369.621, 141... 20.0
Portugal POLYGON ((-753746.126 4998498.047, -753746.126... 9.0
South Africa POLYGON ((1511555.534 -3476803.772, 1511555.53... 15.0
Spain POLYGON ((-649206.283 4462492.961, -649206.283... 21.0
Switzerland POLYGON ((752798.109 5611883.754, 752798.109 5... 51.0
Turkey MULTIPOLYGON (((3897283.57 4470735.777, 389728... 13.0
United Kingdom MULTIPOLYGON (((-4539591.183 -6065808.364, -45... 918.0

We can already see that UK is likely this user's home location, with 918 total userdays of activity.

In [45]:
def choropleth_map(df: pd.DataFrame):
    fig, ax = plt.subplots(1, 1, figsize=(15, 10))

    df.plot(
        column="estimated_userdays",
        cmap="Reds",
        linewidth=0.5,
        ax=ax,
        edgecolor="0.6",
        legend=True,
        legend_kwds={
            "label": "Estimated Number of User-Days",
            "orientation": "horizontal",
            "shrink": 0.6,
            "pad": 0.01
        },
        missing_kwds={
            "color": "lightgrey",
            "label": "No Activity"
        }
    )
    
    user_id = user_df['user_guid'].iloc[0]
    ax.set_title(f"User-Day Activity per Country for User {i}", fontsize=16)
    ax.set_axis_off()
    
    plt.show()
In [46]:
choropleth_map(map_data)
No description has been provided for this image

We can extract this by sorting by userday estimage in descending order

In [47]:
map_data_sorted = map_data.sort_values(by="estimated_userdays", ascending=False)
print(f"\n=> The most likely home country for this user is: {map_data_sorted.index[0]}")
=> The most likely home country for this user is: United Kingdom

Batch Processing the Entire Dataset

Now that the logic for a single user is established, we will apply it to the entire dataset. The following script is designed to be robust and resumable:

  • It reads an output CSV to identify and skip users that have already been processed.
  • It streams users one by one to keep memory usage low.
  • It appends each new result to the CSV immediately, ensuring progress is saved continuously.

This allows the job to be stopped and restarted at any time without losing work.

In [48]:
def determine_home_country(
    user_df: pd.DataFrame,
    world_gdf: gp.GeoDataFrame,
    db_cursor: psycopg2.extensions.cursor
) -> Optional[str]:
    """
    Processes a single user's DataFrame to determine their most likely home country.

    Args:
        user_df: The DataFrame containing all locations for a single user.
        world_gdf: The pre-dissolved world GeoDataFrame.
        db_cursor: An active psycopg2 database cursor.

    Returns:
        The name of the most likely home country as a string, or None if no locations match.
    """
    # 1. Convert user's posts to a GeoDataFrame
    user_locations_proj = gp.GeoDataFrame(
        user_df,
        geometry=gp.points_from_xy(user_df.longitude, user_df.latitude),
        crs=CRS_WGS
    ).to_crs(world_gdf.crs)

    # 2. Spatial Join
    joined_gdf = gp.sjoin(user_locations_proj, world_gdf, how="inner", predicate='within')
    
    # If user has no locations within any country polygons, return None
    if joined_gdf.empty:
        return None

    # 3. Group by country and collect HLLs
    country_hll_groups = joined_gdf.groupby('country')['userday_hll'].apply(list)
    
    # 4. Calculate total user-days per country
    userdays_per_country = country_hll_groups.apply(
        lambda hlls: union_hll(hlls, db_cursor)
    )
    
    # 5. Determine home country by finding the max
    if userdays_per_country.empty:
        return None
        
    return userdays_per_country.idxmax()

Reset/load base data

In [49]:
OUTPUT_CSV = OUTPUT / "user_home_locations.csv"

world = tools.get_shapes("world", shape_dir=OUTPUT / "shapes")
world.to_crs(CRS_PROJ, inplace=True)
world['geometry'] = world.buffer(0.0001)
world_dissolved = world.dissolve(by='country')
Already exists

Optionally resume from previous run

In [50]:
def check_output():
    if not OUTPUT_CSV.exists():
        return
    df_processed = pd.read_csv(OUTPUT_CSV)
    processed_users = set(df_processed['user_guid'])
    print(f"Found {len(processed_users)} users in existing output file. These will be skipped.")

processed_users = set()
try:
    check_output()
except pd.errors.EmptyDataError:
    print("Output file exists but is empty. Starting from scratch.")
Found 582852 users in existing output file. These will be skipped.

The final loop. This opens the output file in append mode. The with statement ensures it's properly closed.

In [51]:
if OUTPUT_CSV.exists():
    print("Skipping existing output file. Remove it to process again")
else: 
    with open(OUTPUT_CSV, 'a', newline='', encoding='utf-8') as f:
        # Create a CSV writer object
        writer = csv.writer(f)
        # Write the header only if the file is new (i.e., was just created)
        if not processed_users:
            writer.writerow(['user_guid', 'country'])
        # Create the user data streamer
        user_generator = stream_user_locations(DE_USERDAYS)
        print("\nStarting user processing...")
        # Loop through every user in the huge input file
        for i, user_df in enumerate(user_generator):
            user_id = user_df['user_guid'].iloc[0]
            if user_id in processed_users:
                # Skip to the next user
                continue
            home_country = determine_home_country(
                user_df, world_dissolved, cur_hll)
            if home_country:
                writer.writerow([user_id, home_country])
            if (i + 1) % 100 == 0:
                clear_output(wait=True)
                display(f"Processed {i + 1} users...")

print(f"Processing complete. Results are in {OUTPUT_CSV}")
Skipping existing output file. Remove it to process again
Processing complete. Results are in /home/jovyan/work/digital_traces_map/out/user_home_locations.csv

Running the Batch Process as a Standalone Script

For a long-running task like this, it is recommended to execute the core logic as a Python script within a persistent terminal session. This ensures the process continues to run even if the connection to JupyterLab is lost or the user logs out of JupyterLab.

Using the Jupytext Paired Script

This notebook is configured with the Jupytext extension to automatically sync its content to a paired Python file at ../py/_01_clustering.py. The logic of pairing is configured in ../jupytext.toml. To ensure the script is runnable, exploratory cells intended only for interactive use in Jupyter have been tagged with active-ipynb. Jupytext is set up to exclude these tagged cells from the final Python script, leaving only the necessary code for the batch process.

Execution Steps

To run the script, connect to the Carto-Lab instance via shell and use a terminal multiplexer like byobu to create a detachable session.

# 1. Start a new persistent session
byobu

# 2. Open a shell inside the running jupyterlab container
docker compose exec jupyterlab /bin/bash

# 3. Navigate to the directory containing the paired Python script
cd /home/jovyan/work/digital_traces_map/py

# 4. Activate the correct Conda environment
conda activate worker_env

# 5. Run the script.
python _01_clustering.py

Once the script is running, you can safely disconnect from the byobu session by pressing F6. To check on the process later, reconnect to the server and run byobu to re-attach to your session.

Verify the output CSV

Once the script is finished, check the CSV again with:

In [52]:
check_output()
Found 582852 users in existing output file. These will be skipped.

These ~600k users exclude the 90% of users who were active in DE that did not have enough userday activity. Attempting to infer a home location for the 5.6 million casual users would have produced extremely noisy, unreliable, and often incorrect results is a benefit here. By filtering them out, we ensure that when our script assigns a home country to a user, it is based on a substantial amount of evidence. This significantly increases the confidence and validity of our final ~600k results.

Classify Tourists and Locals

The final step in this workflow is to classify users as either "Locals" or "Tourists" based on their inferred home location. The classification rule is straightforward: if a user's home is determined to be Germany, they are classified as a Local; otherwise, they are a Tourist.

This simple binary classification is a deliberate choice. While more nuanced methods (e.g., based on spatial distance or fine-grained clustering) may be applicable in other contexts, this robust, high-level scheme is the most appropriate starting point given the coarse, country-level granularity of our home location inference.

The code infrastructure established in this and the preceding notebook provides a solid foundation for future extensions, such as implementing a more detailed classification based on distance measures or sub-national clustering.

In [53]:
INPUT_ORIGIN_CSV = OUTPUT_CSV
OUTPUT_CLASSIFICATION_CSV = OUTPUT / "user_classification.csv"

Load the CSV

In [54]:
df_homes = pd.read_csv(
    INPUT_ORIGIN_CSV,
    header=0,        # Explicitly state the first row (index 0) is the header.
    skiprows=[1]     # Skip the second row (index 1), which is a duplicate header accidentally added in our above script.
)
In [55]:
df_homes.head()
Out[55]:
user_guid country
0 +++r4aw85H0B5jmc9qeu8A4N0uBNk53gB3xFNgNDPm4 Czechia
1 +++zZva3gvRgQ2QVLIYnozx2mYqg3IiDEO0O8J5/pbI Russia
2 ++/edOEVTiQxDBPYJkejpdh/ipIqjgdmWFAwL+66bj4 Australia
3 ++0+/uX8aZy+DROJ78UqLtcTmoaC+2FzbAp6iL3liEY United Kingdom
4 ++0qcyrmUB3ASb/ev2f4XJ+4BmLxNrDMRhDldM3okS0 Belgium

Default classification as Tourist, then select the subset with Germany as origin and classify as Local.

In [56]:
df_homes['classification'] = 'Tourist'
df_homes.loc[
    df_homes['country'] == 'Germany',
    'classification'] = 'Local'

Write to new output file.

In [57]:
usecols = ["user_guid", "classification"]
df_homes.to_csv(OUTPUT_CLASSIFICATION_CSV, index=False, columns=usecols)

We can also visualize the distribution of Locals vs. Tourists in our dataset.

In [58]:
classification_counts = df_homes['classification'].value_counts()

print("Distribution of Users:")
print(classification_counts)

ratio_value = classification_counts.iloc[0] / classification_counts.iloc[1]
larger_group_name = classification_counts.index[0]
smaller_group_name = classification_counts.index[1]

print(f"The ratio of {larger_group_name}s to {smaller_group_name}s is {ratio_value:.1f} to 1.")
Distribution of Users:
classification
Tourist    409177
Local      173674
Name: count, dtype: int64
The ratio of Tourists to Locals is 2.4 to 1.

This is what other publications reported as well, a 1:2 to 1:3 distribution of Locals vs. Tourists in Geo-Social Media data.

In [59]:
plt.figure(figsize=(9, 4))

sns.set_theme(style="white")
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

plt.rcParams.update({
    "axes.labelsize": 14,
    "axes.titlesize": 16,
    "xtick.labelsize": 12,
    "ytick.labelsize": 12,
    "legend.fontsize": 12
})

order = df_homes['classification'].value_counts().sort_values().index

ax = sns.countplot(
    x='classification',
    data=df_homes,
    order=order,
    hue='classification',
    palette=sns.color_palette("vlag", n_colors=2)[::-1],  # reversed colors
    legend=False
)

ax.yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, pos: f'{x:,.0f}'))

fig = ax.get_figure()
# ax.set_title(
#    'Distribution of Locals vs. Tourists in Germany Dataset',
#    fontsize=16)
ax.set_xlabel('User Classification', fontsize=12)
ax.set_ylabel('Number of Users', fontsize=12)

for p in ax.patches:
    ax.annotate(
        f'{p.get_height():,.0f}',
        (p.get_x() + p.get_width() / 2., p.get_height()),
        ha='center', va='center',
        xytext=(0, 9), textcoords='offset points',
        fontsize=16, fontweight='bold')
sns.despine()
plt.show()
No description has been provided for this image
In [60]:
tools.save_fig(fig, output=OUTPUT, name="barplot_locals_tourists_de")

Create notebook HTML

In [ ]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./01_clustering.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
•••

IOER FDZ Jupyter Base Template v0.13.0