PLZ Visualization Geosocial Data for DE

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

No description has been provided for this image
Summarize Geosocial Media for PLZ areas in Germany.

Base data:

  • 50 Million Geosocial Media posts in HLL format
  • sorted based on user-id, number of posts per user-id ascending
  • ~9000 PLZ areas (shapes)
•••
Out[1]:

Last updated: Jun-12-2025, Carto-Lab Docker Version 0.28.0

Preparations

Install additional dependencies not in worker_env

•••
List of package versions used in this notebook
package python bokeh colorcet fiona geopandas geoviews holoviews hvplot ipywidgets mapclassify
version 3.13.3 3.7.2 3.1.0 1.10.1 1.0.1 1.14.0 1.20.2 0.11.3 8.1.7 2.8.1
package matplotlib matplotlib-venn numpy owslib pandas python-dotenv shapely xarray
version 3.10.1 1.1.2 2.2.5 0.34.0 2.2.3 1.1.0 2.1.0 2025.4.0

Load dependencies:

In [3]:
import mapclassify as mc
import geopandas as gp
import pandas as pd
import numpy as np
import hdbscan
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import textwrap
from pathlib import Path
from datetime import datetime

Activate autoreload of changed python files:

In [4]:
%load_ext autoreload
%autoreload 2

Load helper module.

In [5]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools, preparations

Initialize Bokeh and shapely.speedups

In [6]:
preparations.init_imports()
No description has been provided for this image No description has been provided for this image

Parameters

Define initial parameters that affect processing

In [7]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
EPSG_CODE = 25832                            # Target projection EPSG Code
CRS_PROJ = f"epsg:{EPSG_CODE}"               # Target projection
CRS_WGS = "epsg:4326"                        # Input projection (Web Mercator)
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
SHAPE_DIR = OUTPUT / "shapes"                # Shapefiles
In [8]:
(OUTPUT / "figures").mkdir(exist_ok=True)
(OUTPUT / "svg").mkdir(exist_ok=True)
WORK_DIR.mkdir(exist_ok=True)

today = datetime.today().strftime('%Y-%m-%d')

Please turn this off during development:

In [9]:
warnings.filterwarnings("ignore")
warnings.filterwarnings('ignore', category=DeprecationWarning)
In [10]:
tools.create_paths(OUTPUT)

Set global scientific font

In [11]:
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

PLZ shapes

This file was loaded in the previous notebook.

In [12]:
plz_shapes = OUTPUT / "plz-5stellig.shp"

Load from file and plot preview

In [13]:
shapes = gp.read_file(plz_shapes, encoding='utf-8')
In [14]:
shapes.columns
Out[14]:
Index(['plz', 'note', 'einwohner', 'qkm', 'geometry'], dtype='object')
In [15]:
ax = shapes.plot()
ax.set_axis_off()
len(shapes)
Out[15]:
8170
No description has been provided for this image

Preview PLZ Geosocial Counts

Load CSV from the previous notebook.

In [16]:
df_plz = pd.read_csv(OUTPUT / f"2025-06-11_PLZ_DE_Geosocial.csv", dtype={"PLZ": str})

Join based on PLZ to shapes

In [17]:
shapes.set_index("plz", inplace=True)
df_plz.set_index("PLZ", inplace=True)

Convert both indices to string and strip spaces

In [18]:
shapes.index = shapes.index.astype(str).str.strip()
df_plz.index = df_plz.index.astype(str).str.strip()
In [19]:
df_plz.head()
Out[19]:
postcount usercount
PLZ
10179 3638747 286920
21706 1157988 58055
10117 1097835 367060
80636 1026438 83436
10178 1015720 389794
In [20]:
plz_counts_df = shapes.merge(
    df_plz[["postcount", "usercount"]],
    left_index=True, right_index=True, how="left")

Conert float to int for the counts

In [21]:
plz_counts_df['usercount'] = plz_counts_df['usercount'].fillna(0).astype(int)
plz_counts_df['postcount'] = plz_counts_df['postcount'].fillna(0).astype(int)
In [22]:
plz_counts_df.head()
plz_counts_df.loc["10179"]
Out[22]:
note                                        10179 Berlin Mitte
einwohner                                                18664
qkm                                                   2.175681
geometry     POLYGON ((13.4016439 52.511369, 13.4021162 52....
postcount                                              3638747
usercount                                               286920
Name: 10179, dtype: object

Visualize

First project to UTM zone 32N (suitable for for Germany).

In [23]:
epsg_code = 25832
plz_counts_df.to_crs(f"EPSG:{epsg_code}", inplace=True)

Get Shapefile for Germany

In [24]:
source_zip = "https://daten.gdz.bkg.bund.de/produkte/vg/vg2500/aktuell/"
filename = "vg2500_12-31.utm32s.shape.zip"
shapes_name = "vg2500_12-31.utm32s.shape/vg2500/VG2500_LAN.shp"
In [25]:
SHAPE_DIR = (OUTPUT / "shapes")
SHAPE_DIR.mkdir(exist_ok=True)

if not (SHAPE_DIR / shapes_name).exists():
    tools.get_zip_extract(uri=source_zip, filename=filename, output_path=SHAPE_DIR)
else:
    print("Already exists")
Already exists
In [26]:
shapes = gp.read_file(SHAPE_DIR / shapes_name)
shapes = shapes.to_crs(f"EPSG:{epsg_code}")

Create overlay

In [27]:
def plot_map(ax, title, caption=None):
    leg = ax.get_legend()
    leg.set_bbox_to_anchor((0., 0., 0.0, 0.3))
    leg.get_frame().set_edgecolor('none')  # remove edge line

    ax.set_title(title)
    ax.set_axis_off()
    if caption is None:
        return
    wrapped_caption = textwrap.fill(caption, width=80)
    plt.figtext(0.5, 0.15, wrapped_caption, ha='center', va='top', fontsize=9)

title = "PLZ Geosocial Media Usercount\nQuantiles \n (Instagram, Flickr, Twitter, iNaturalist)"
ax = plz_counts_df.plot(
    column='usercount', cmap='OrRd', scheme='Quantiles', figsize=(5, 8), legend=True)
ax = shapes.plot(ax=ax, color='none', edgecolor='white', linewidth=1.5)

plot_map(ax, title)
No description has been provided for this image

A different classification scheme for improved interpretability:

In [28]:
title = "PLZ Geosocial Media Usercount \nNatural Breaks\n (Instagram, Flickr, Twitter, iNaturalist)"

ax = plz_counts_df.plot(
    column='usercount', cmap='OrRd', scheme='NaturalBreaks', k=5, figsize=(5, 8), legend=True)
ax = shapes.plot(ax=ax, color='none', edgecolor='white', linewidth=1.5)

plot_map(ax, title)
No description has been provided for this image

Active user per population ratio

  1. Replace zero or missing population with np.nan to avoid division errors
  2. Divide usercount by einwohner (population counts), per PLZ area
In [29]:
plz_counts_df['einwohner'] = plz_counts_df['einwohner'].replace(0, np.nan)
plz_counts_df["user_pop_ratio"] = plz_counts_df["usercount"] / plz_counts_df["einwohner"]

Let's look at the distribution of values

In [30]:
classifier = mc.Quantiles(plz_counts_df['user_pop_ratio'], k=5)
print("Class breaks:", classifier.bins)
Class breaks: [0.02939472 0.04825401 0.07621684 0.14366433        nan]

The values are very skewed, which is expected for ratios derived from small denominators (populations).

We can create a more balanced map by preprocessing data with a Log Transformation and then classifying this log-transformed data.

  1. Apply log1p to handle zeroes safely
  2. Classify and plot the log-transformed data
In [31]:
plz_counts_df['log_user_pop_ratio'] = np.log1p(plz_counts_df['user_pop_ratio'])
cluster_input = plz_counts_df.dropna(subset=['log_user_pop_ratio'])

ax = cluster_input.plot(
    column='log_user_pop_ratio', cmap='OrRd', scheme='NaturalBreaks', k=5, figsize=(5, 8), legend=True)
ax = shapes.plot(ax=ax, color='none', edgecolor='white', linewidth=1.5)

plot_map(ax, "Log-scaled Active Geosocial media Users\nper Population Ratio\n(PLZ based evaluation)")
No description has been provided for this image

The red areas have an unusually high level of social media activity compared to population density. Areas such as Saxon Switzerland, a popular recreational area near Dresden, stand out. Other areas include the Alpine foothills, the areas surrounding Berlin, the Baltic coast and Freiburg. All of these areas are particularly popular tourist destinations in Germany. This map therefore confirms our expectations, thereby increasing trust in the data.

Areas where the population is high but social media user activity is low

Let's look at the inverse mapping of user concentration relative to population density, to highlight underrepresented or "locally dense but socially quiet" areas.

In [32]:
plz_counts_df["pop_user_ratio"] = plz_counts_df["einwohner"] / (plz_counts_df["usercount"] + .1)
plz_counts_df['log_pop_user_ratio'] = np.log1p(plz_counts_df['pop_user_ratio'])
cluster_input = plz_counts_df.dropna(subset=['log_pop_user_ratio'])

ax = cluster_input.plot(
    column='log_pop_user_ratio', cmap='YlGnBu', scheme='NaturalBreaks', k=5, figsize=(5, 8), legend=True)
ax = shapes.plot(ax=ax, color='none', edgecolor='white', linewidth=1.5)

plot_map(ax, "Local Dominance: High Population / Low Geosocial Media Activity")
No description has been provided for this image

Account for area bias

We also need to account for area bias. Large PLZs often cover rural or suburban regions and can artificially appear to have low user density or high local dominance simply because their population and user counts are spread over a larger area.

Calculate area.

In [33]:
plz_counts_df['area_m2'] = plz_counts_df.geometry.area
plz_counts_df['area_km2'] = plz_counts_df['area_m2'] / 1e6

Compute densities.

In [34]:
plz_counts_df['pop_density'] = \
    plz_counts_df['einwohner'] / plz_counts_df['area_km2']
plz_counts_df['user_density'] = \
    plz_counts_df['usercount'] / plz_counts_df['area_km2']

Update ratios to a density-based metric.

In [35]:
plz_counts_df['density_ratio'] = \
    plz_counts_df['pop_density'] / (plz_counts_df['user_density']+.1)
plz_counts_df['log_density_ratio'] = \
    np.log1p(plz_counts_df['density_ratio'])
In [36]:
cluster_input = plz_counts_df.dropna(subset=['log_density_ratio'])

ax = cluster_input.plot(
    column='log_density_ratio', cmap='YlGnBu', scheme='NaturalBreaks', k=5, figsize=(5, 8), legend=True)
ax = shapes.plot(ax=ax, color='none', edgecolor='white', linewidth=1.5)

plot_map(ax, "Local Dominance: High Population / Low Geosocial Media Activity \n(area normalized)")
No description has been provided for this image

Export map to SVG and PNG

In [37]:
fig = ax.figure
tools.save_fig(fig, output=OUTPUT, name=f"{today}_log_density_ratio_map")

Extract PLZ selection ranking

List the top PLZ with more than 5000 population and more than 5000 active geosocial media users.

In [38]:
cluster_input.head()
Out[38]:
note einwohner qkm geometry postcount usercount user_pop_ratio log_user_pop_ratio pop_user_ratio log_pop_user_ratio area_m2 area_km2 pop_density user_density density_ratio log_density_ratio
plz
64743 Situation unklar, evtl. haben die Häuser Marba... 3.0 0.082066 POLYGON ((498644.759 5495004.341, 498780.627 5... 1 1 0.333333 0.287682 2.727273 1.315677 8.206624e+04 0.082066 36.555835 12.185278 2.975581 1.380171
81248 81248 München 121.0 1.984763 POLYGON ((678117.549 5335444.089, 678132.152 5... 223 76 0.628099 0.487413 1.590013 0.951663 1.984763e+06 1.984763 60.964473 38.291735 1.587958 0.950869
99331 99331 Geratal 4523.0 20.207080 POLYGON ((626529.596 5617413.746, 626538.563 5... 355 142 0.031395 0.030912 31.829697 3.491334 2.020708e+07 20.207080 223.832440 7.027240 31.405206 3.478319
98694 98694 Ilmenau 7028.0 80.375011 MULTIPOLYGON (((629113.368 5610039.263, 629154... 988 293 0.041690 0.040845 23.978164 3.218002 8.037501e+07 80.375011 87.440112 3.645412 23.345929 3.192365
39628 39628 Bismark 3147.0 125.135930 POLYGON ((674584.962 5829715.322, 674590.038 5... 253 132 0.041945 0.041089 23.822861 3.211765 1.251359e+08 125.135930 25.148652 1.054853 21.776498 3.125729
In [39]:
top_plz = (
    cluster_input[
        (cluster_input['einwohner'] >= 5000) &
        (cluster_input['usercount'] >= 5000) &
        (cluster_input['area_km2'] <= 50)  # adjust threshold
    ]
    .sort_values(by='log_density_ratio', ascending=False)
)
print(f"Identified {len(top_plz)} PLZ with more than 5000 population and more than 5000 active geosocial media users")

print(top_plz.assign(plz=top_plz.index).head(10)[
    ['plz', 'einwohner', 'usercount', 'pop_user_ratio', 'log_pop_user_ratio', 'log_density_ratio']]
    .to_string(index=False, float_format='{:,.2f}'.format))
Identified 423 PLZ with more than 5000 population and more than 5000 active geosocial media users
  plz  einwohner  usercount  pop_user_ratio  log_pop_user_ratio  log_density_ratio
50226  49,670.00       5221            9.51                2.35               2.35
68723  41,978.00       5282            7.95                2.19               2.19
71229  44,608.00       6032            7.40                2.13               2.13
09599  39,861.00       6855            5.81                1.92               1.92
51103  36,171.00       6227            5.81                1.92               1.92
01309  32,890.00       5672            5.80                1.92               1.92
60439  35,056.00       6052            5.79                1.92               1.92
70771  36,404.00       6503            5.60                1.89               1.89
81737  28,379.00       5393            5.26                1.83               1.83
60385  33,049.00       6361            5.20                1.82               1.82

What we did:

  • calculate ratios per PLZ between population and geosocial media activity
  • Normalize ratios by PLZ area, to avoid area bias for large PLZ
  • Find areas where the local population is large relative to geosocial media user activity.
  • Applying minimum thresholds (5000 population and 5000 users) ensures statistical robustness and avoids noisy small-sample PLZ.
  • Applying a maximum threshold of 50km² for PLZ selection, to avoid overselecting few large-area PLZs with sparse user densities
  • Sorting by log_pop_user_ratio in descending order correctly prioritizes areas with the highest local dominance.

Visualize these 423 PLZ

In [40]:
ax = top_plz.plot(
    column='log_density_ratio', cmap='YlGnBu', scheme='NaturalBreaks', k=5, figsize=(5, 8), legend=True)
ax = shapes.plot(ax=ax, color='none', edgecolor='white', linewidth=1.5)

plot_map(
    ax=ax,
    title="Local Dominance: High Population / Low Geosocial Media Activity\n" \
          "(but more than 5000 population and\nmore than 5000 active geosocial media users)",
    caption = (
        "The map highlights postal code areas (PLZ) where the local population is significantly "
        "overrepresented compared to the number of active geosocial media users. "
        "These areas are likely less touristic or less digitally active in terms of geosocial platforms. "
        "Only PLZs with more than 5,000 residents and more than 5,000 active users were considered to "
        "ensure statistical relevance. In addition, very large and sparsely populated PLZ where excluded "
        "by a maximum 50 km² area threshold."))
No description has been provided for this image
In [41]:
fig = ax.figure
tools.save_fig(fig, output=OUTPUT, name=f"{today}_log_density_ratio_PLZ_selection")

Identify clusters

We just want to see distribution and identify major clusters. An approach is shown by João Paulo Figueira's Geographic Clustering with HDBSCAN that I used in a previous notebook).

The critical parameter below is min_cluster_size (lower values result in more but closer located clusters, larger values will aggregate more, but may lead to far apart islands).

In [42]:
top_plz_centroids_wgs84 = top_plz.geometry.to_crs(epsg=4326).centroid

lon = top_plz_centroids_wgs84.x
lat = top_plz_centroids_wgs84.y

data_for_hdbscan = np.array(list(zip(lat, lon)))

X_radians = np.radians(data_for_hdbscan)

min_pts_in_cluster = 2
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=min_pts_in_cluster,
    metric='haversine',
    allow_single_cluster=True
)
clusterer.fit(X_radians)

top_plz['cluster_label'] = clusterer.labels_

num_clusters = len(set(clusterer.labels_)) - (1 if -1 in clusterer.labels_ else 0)
num_noise = (clusterer.labels_ == -1).sum()
print(f"HDBSCAN found {num_clusters} clusters and {num_noise} noise points.")

unique_labels = sorted(list(set(clusterer.labels_)))


palette = sns.color_palette('viridis', num_clusters)
cluster_color_map = {
    label: palette[i] for i, label in enumerate(
        l for l in unique_labels if l != -1)}
cluster_color_map[-1] = (0.7, 0.7, 0.7, 1)

fig, ax = plt.subplots(1, 1, figsize=(8, 12))

top_plz.plot(
    ax=ax,
    column='cluster_label',
    categorical=True,
    legend=True,
    legend_kwds={'title': "Cluster ID", 'loc': 'lower left', 'bbox_to_anchor': (0.0, 0.0)},
    cmap=plt.cm.colors.ListedColormap([cluster_color_map[label] for label in unique_labels]),
    missing_kwds={
        "color": "lightgrey",
        "edgecolor": "red",
        "hatch": "///",
        "label": "No Cluster (Noise)"
    }
)

shapes.plot(ax=ax, color='none', edgecolor='black', linewidth=0.5, zorder=0)

plot_map(
    ax=ax,
    title=f"Geographic Clusters (HDBSCAN) of Selected PLZ Areas\n({num_clusters} clusters, min_size={min_pts_in_cluster})",
)

plt.tight_layout()
plt.show()
HDBSCAN found 42 clusters and 40 noise points.
No description has been provided for this image
In [43]:
fig = ax.figure
tools.save_fig(fig, output=OUTPUT, name=f"{today}_cluster_map")

Convert to Convex Hulls/Alpha Shapes

Ideally, we want to extract contiguous areas of PLZ with no gaps or isolated areas, including PLZ that are not directly identified as clusters, provided they are in close proximity to clusters. We can achieve this using a Convex Hull/Alpha Shapes approach.

In [44]:
BUFFER_DISTANCE_METERS = 1000

all_plz_shapes = gp.read_file(plz_shapes, encoding='utf-8')

if top_plz.crs != all_plz_shapes.crs:
    all_plz_shapes = all_plz_shapes.to_crs(top_plz.crs)

expanded_cluster_plzs_list = []
unique_cluster_ids = sorted([cid for cid in top_plz['cluster_label'].unique() if cid != -1])

for cid in unique_cluster_ids:
    cluster_geometries = top_plz[top_plz['cluster_label'] == cid].geometry
    if cluster_geometries.empty:
        continue
    cluster_union = cluster_geometries.unary_union
    cluster_hull = cluster_union.convex_hull
    buffered_hull = cluster_hull.buffer(BUFFER_DISTANCE_METERS)
    hull_gdf = gp.GeoDataFrame(
        [{'geometry': buffered_hull, 'cluster_id_hull': cid}], crs=top_plz.crs)
    plzs_in_buffered_hull = gp.sjoin(
        all_plz_shapes, hull_gdf, how="inner", predicate="intersects")

    if not plzs_in_buffered_hull.empty:
        selected_plzs_with_id = plzs_in_buffered_hull[['plz', 'geometry']].copy()
        selected_plzs_with_id['expanded_cluster_id'] = cid
        expanded_cluster_plzs_list.append(selected_plzs_with_id)

all_expanded_plzs = pd.concat(expanded_cluster_plzs_list)
final_expanded_clusters = all_expanded_plzs.dissolve(
    by='expanded_cluster_id', aggfunc='first')
final_expanded_clusters = final_expanded_clusters.reset_index()

print(f"Created {len(final_expanded_clusters)} expanded cluster areas.")
Created 42 expanded cluster areas.
In [45]:
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
final_expanded_clusters.plot(
    ax=ax, column='expanded_cluster_id', categorical=True, 
    legend=True, alpha=0.7, legend_kwds={'ncol': 2, 'title': 'Expanded Cluster ID'})
top_plz[top_plz['cluster_label'] != -1].plot(
    ax=ax, column='cluster_label', categorical=True, legend=False,
    edgecolor='black', linewidth=0.5, alpha=0.5)
shapes.plot(ax=ax, color='none', edgecolor='black', linewidth=0.2, zorder=0)
ax.set_axis_off()
ax.set_title(f"Geographic Clusters (HDBSCAN) of Selected PLZ Areas as Convex Hulls/Alpha Shapes")
leg = ax.get_legend()
leg.set_bbox_to_anchor((0., 0., 0.0, 0.6))
leg.get_frame().set_edgecolor('none')
plt.show()
No description has been provided for this image

Merge back to original PLZ shapes, so we can export this.

In [46]:
if final_expanded_clusters.crs != all_plz_shapes.crs:
    all_plz_shapes = all_plz_shapes.to_crs(final_expanded_clusters.crs)
In [47]:
plz_in_expanded_clusters = gp.sjoin(
    all_plz_shapes, 
    final_expanded_clusters[['expanded_cluster_id', 'geometry']], 
    how="inner", 
    predicate="intersects")
In [48]:
if 'index_right' in plz_in_expanded_clusters.columns:
    plz_in_expanded_clusters = plz_in_expanded_clusters.drop(columns=['index_right'])
In [49]:
len(plz_in_expanded_clusters)
Out[49]:
3043

Export PLZ selection

Export to GeoPackage, Shapefile and CSV (geometry dropped). Remove not relevant columns first.

In [50]:
top_plz.columns
Out[50]:
Index(['note', 'einwohner', 'qkm', 'geometry', 'postcount', 'usercount',
       'user_pop_ratio', 'log_user_pop_ratio', 'pop_user_ratio',
       'log_pop_user_ratio', 'area_m2', 'area_km2', 'pop_density',
       'user_density', 'density_ratio', 'log_density_ratio', 'cluster_label'],
      dtype='object')
In [51]:
tools.drop_cols_except(
    df=top_plz, 
    columns_keep=[
        "geometry", "einwohner", "plz", "qkm", "cluster_label",
        "note", "usercount", "postcount", "log_density_ratio"])
top_plz['einwohner'] = top_plz['einwohner'].astype('Int64')
In [52]:
top_plz.to_file(
    OUTPUT / f"{today}_top_plz.gpkg", layer="top_plz", driver="GPKG")
top_plz.drop(columns='geometry').to_csv(
    OUTPUT / f"{today}_top_plz.csv", index=True, encoding='utf-8')

top_plz.to_file(
    OUTPUT / f"{today}_top_plz.shp", driver="ESRI Shapefile")

Do the same for final_expanded_clusters

In [53]:
plz_in_expanded_clusters.to_file(
    OUTPUT / f"{today}_expanded_clusters.gpkg", layer="expanded_clusters", driver="GPKG")
plz_in_expanded_clusters.drop(columns='geometry').to_csv(
    OUTPUT / f"{today}_expanded_clusters.csv", index=True, encoding='utf-8')

plz_in_expanded_clusters.to_file(
    OUTPUT / f"{today}_expanded_clusters.shp", driver="ESRI Shapefile")

Create notebook HTML

In [45]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./07_plz_visualization.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]:
 

IOER RDC Jupyter Base Template v0.10.0