PLZ Intersection Geosocial Data for DE

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

No description has been provided for this image
Summarize Geosocial Media for PLZ areas in Germany.

Base data:

  • 50 Million Geosocial Media posts in HLL format
  • sorted based on user-id, number of posts per user-id ascending
  • ~9000 PLZ areas (shapes)
•••
Out[1]:

Last updated: Jun-11-2025, Carto-Lab Docker Version 0.28.0

Preparations

Install additional dependencies not in worker_env

In [2]:
import sys
pyexec = sys.executable
print(f"Current Kernel {pyexec}")
!../py/modules/base/pkginstall.sh "{pyexec}" owslib
Current Kernel /opt/conda/envs/worker_env/bin/python3.13
owslib already installed (version 0.34.0).
•••
List of package versions used in this notebook
package python bokeh colorcet fiona geopandas geoviews holoviews hvplot ipywidgets mapclassify
version 3.13.3 3.7.2 3.1.0 1.10.1 1.0.1 1.14.0 1.20.2 0.11.3 8.1.7 2.8.1
package matplotlib matplotlib-venn numpy owslib pandas python-dotenv shapely xarray
version 3.10.1 1.1.2 2.2.5 0.34.0 2.2.3 1.1.0 2.1.0 2025.4.0

Load dependencies:

In [4]:
import os
import csv
import sys
import colorcet
import psycopg2 # Postgres API
import geoviews as gv
import holoviews as hv
import mapclassify as mc
import geopandas as gp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geoviews.feature as gf
import requests
from owslib.wfs import WebFeatureService
from pathlib import Path
from collections import namedtuple
from pathlib import Path
from typing import List, Tuple, Dict, Optional
from pyproj import Transformer, CRS, Proj
from geoviews import opts
from shapely.geometry import shape, Point, Polygon
from shapely.ops import transform
from holoviews import dim
from cartopy import crs
from matplotlib import colors
from IPython.display import clear_output, display, HTML, Markdown, Code
from bokeh.models import HoverTool, CustomJSTickFormatter, FixedTicker
# optionally, enable shapely.speedups 
# which makes some of the spatial 
# queries running faster
import shapely.speedups as speedups
from shapely import geometry

Activate autoreload of changed python files:

In [5]:
%load_ext autoreload
%autoreload 2

Load helper module.

In [6]:
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
    sys.path.append(module_path)
from modules.base import tools, preparations

Initialize Bokeh and shapely.speedups

In [7]:
preparations.init_imports()
No description has been provided for this image No description has been provided for this image

Parameters

Define initial parameters that affect processing

In [8]:
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
EPSG_CODE = 25832                            # Target projection EPSG Code
CRS_PROJ = f"epsg:{EPSG_CODE}"               # Target projection
CRS_WGS = "epsg:4326"                        # Input projection (Web Mercator)
OUTPUT = Path.cwd().parents[0] / "out"       # Define path to output directory (figures etc.)
SHAPE_DIR = OUTPUT / "shapes"                # Shapefiles
In [9]:
WORK_DIR.mkdir(exist_ok=True)
In [10]:
tools.create_paths(OUTPUT)

PLZ shapes

The BKG says that access to the PLZ dataset is restricted to Bundesbehörden: https://gdz.bkg.bund.de/index.php/default/postleitzahlgebiete-deutschland-plz.html

However, there is an open data available at suche-postleitzahl.org

We will query the Shapefile and write all shapes to a local geopackage.

In [11]:
plz_url = "https://downloads.suche-postleitzahl.org/v2/public/plz-5stellig.shp.zip"
In [12]:
tools.get_zip_extract(
        output_path=OUTPUT,
        uri_filename=plz_url)
File already exists.. skipping download..
In [13]:
plz_shapes = OUTPUT / "plz-5stellig.shp"

Load from file and plot preview

In [14]:
shapes = gp.read_file(plz_shapes)
In [15]:
shapes.columns
Out[15]:
Index(['plz', 'note', 'einwohner', 'qkm', 'geometry'], dtype='object')
In [16]:
ax = shapes.plot()
ax.set_axis_off()
No description has been provided for this image

Select all larger Cities

In [17]:
tools.drop_cols_except(df=shapes, columns_keep=["geometry", "einwohner", "plz", "qkm"])
In [18]:
shapes.head()
Out[18]:
plz einwohner qkm geometry
0 64743 3 0.082066 POLYGON ((8.98124 49.60761, 8.98312 49.60748, ...
1 81248 121 1.984763 POLYGON ((11.39468 48.14729, 11.3949 48.1478, ...
2 60315 0 0.017285 POLYGON ((8.67254 50.11264, 8.67258 50.11265, ...
3 99331 4523 20.207080 POLYGON ((10.79153 50.69477, 10.79178 50.69819...
4 60312 0 0.001829 POLYGON ((8.67262 50.11164, 8.67311 50.11182, ...

Calculate Einwohner/sqkm

In [19]:
shapes["esqkm"] = shapes["einwohner"] / shapes["qkm"]
In [20]:
shapes["esqkm"].sort_values(ascending=False)
Out[20]:
3898    26645.677088
7623    26152.091224
3902    23057.635920
2198    20466.515536
7587    20157.728988
            ...     
1047        0.000000
5405        0.000000
2           0.000000
4           0.000000
17          0.000000
Name: esqkm, Length: 8170, dtype: float64

Project to UTM32N

In [21]:
shapes.to_crs(epsg=25832, inplace=True)
In [22]:
lcities_shapes = shapes[shapes["esqkm"]>5000]
lcities_shapes.set_index("plz", inplace=True)
In [23]:
lcities_shapes.head()
Out[23]:
einwohner qkm geometry esqkm
plz
47051 13928 2.389929 POLYGON ((343786.58 5700477.48, 344031.86 5700... 5827.788190
97072 16840 1.598911 POLYGON ((566734.126 5515149.246, 566805.786 5... 10532.168457
97070 12029 2.211778 POLYGON ((566453.837 5516679.614, 566671.471 5... 5438.610927
22089 19734 1.775989 POLYGON ((568309.587 5935739.741, 568337.606 5... 11111.555308
40219 10449 0.825272 POLYGON ((342706.854 5675946.134, 342743.25 56... 12661.280160

Load Bundesländer shapes, for plotting references.

In [24]:
de_shapes = tools.get_shapes("de", shape_dir=SHAPE_DIR)
de_shapes = de_shapes.to_crs(f"EPSG:{EPSG_CODE}")
Already exists

Plot preview in two layers, select all larger cities.

In [25]:
plt_kwags = {
    "color":'none',
    "edgecolor":'black',
    "linewidth":0.1,
    "figsize":(2, 4),
}
ax = de_shapes.plot(**plt_kwags)
ax = lcities_shapes.plot(ax=ax, color="red")
ax.set_axis_off()
ax.set_title("PLZ with more than 5000 Einwohner/ square kilometre")
Out[25]:
Text(0.5, 1.0, 'PLZ with more than 5000 Einwohner/ square kilometre')
No description has been provided for this image
In [26]:
shapes.crs
Out[26]:
<Projected CRS: EPSG:25832>
Name: ETRS89 / UTM zone 32N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Europe between 6°E and 12°E: Austria; Denmark - onshore and offshore; Germany - onshore and offshore; Italy - onshore and offshore; Norway including Svalbard - onshore and offshore; Spain - offshore.
- bounds: (6.0, 36.53, 12.01, 84.01)
Coordinate Operation:
- name: UTM zone 32N
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

Intersect with Social Media Data

Select a single PLZ from our previous dataset

In [27]:
berlin_shp = lcities_shapes[lcities_shapes.index=="97072"]
In [28]:
ax = berlin_shp.plot(**plt_kwags)
ax.set_axis_off()
No description has been provided for this image

Simplify shape with a 1000 m tolerance.

In [29]:
berlin_shp = berlin_shp.simplify(tolerance=50)
In [30]:
ax = berlin_shp.plot(**plt_kwags)
ax.set_axis_off()
No description has been provided for this image

Get the Well-Known Text (WKT) from the shape, for using this in the PostGis query. Note that we project the shape back to WGS1984, as this is what PostGis expects for the respective column projection.

In [31]:
berlin_wkt = berlin_shp.to_crs(CRS_WGS).iloc[0].wkt
In [32]:
berlin_wkt[:50]
Out[32]:
'POLYGON ((9.927049399999998 49.7850944, 9.9372423 '

Intersect posts per PLZ with HLL Union

Connect to DB

In [33]:
from dotenv import load_dotenv
load_dotenv(
    Path.cwd().parents[0] / '.env', override=True)
Out[33]:
False
In [34]:
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"
In [35]:
db_connection_hll = psycopg2.connect(
        host=db_host,
        port=db_port,
        dbname=db_name,
        user=db_user,
        password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
db_conn = tools.DbConn(db_connection_hll)
print(cur_hll.statusmessage)
SELECT 1
In [36]:
MVIEWS_REF = "all_posts_de"

Check if foreign table has been imported already:

In [37]:
sql_query = f"""
SELECT EXISTS (
   SELECT FROM information_schema.tables 
   WHERE  table_schema = 'mviews'
   AND    table_name   = '{MVIEWS_REF}'
   );
"""
result = db_conn_hll.query(sql_query)
In [38]:
result["exists"][0]
Out[38]:
np.True_
db_connection_hll.rollback()

Postgres data work

PLZ Shapes insertion

To select all posts that have a post_latlng within a PLZ area, we need to insert PLZ areas first.

Optional cleanup:

In [40]:
sql_query = f"""
DROP TABLE IF EXISTS mviews.plz_intersect;
"""
cur_hll.execute(sql_query)
In [41]:
%%time
sql_query = f"""
CREATE TABLE IF NOT EXISTS mviews.plz_intersect (
    id SERIAL PRIMARY KEY,
    geom geometry(MultiPolygon, 4326) NOT NULL,
    plz text
    );
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
CREATE TABLE
CPU times: user 1.01 ms, sys: 114 μs, total: 1.13 ms
Wall time: 15.1 ms

Store PLZ as WKT:

  • first project to WGS1984
  • then convert geometry using geopandas .wkt function
In [42]:
from shapely.geometry import Polygon, MultiPolygon
shapes["geometry_wkt"] = shapes["geometry"].to_crs(epsg=4326).apply(
    lambda geom: geom.wkt if isinstance(geom, (Polygon, MultiPolygon)) else None
)
In [43]:
shapes.head()
Out[43]:
plz einwohner qkm geometry esqkm geometry_wkt
0 64743 3 0.082066 POLYGON ((498644.759 5495004.341, 498780.627 5... 36.555943 POLYGON ((8.981242 49.60760729999999, 8.983122...
1 81248 121 1.984763 POLYGON ((678117.549 5335444.089, 678132.152 5... 60.964458 POLYGON ((11.394678300000002 48.1472871, 11.39...
2 60315 0 0.017285 POLYGON ((476586.556 5551205.983, 476589.828 5... 0.000000 POLYGON ((8.6725366 50.11263989999999, 8.67258...
3 99331 4523 20.207080 POLYGON ((626529.596 5617413.746, 626538.563 5... 223.832439 POLYGON ((10.7915258 50.694768199999984, 10.79...
4 60312 0 0.001829 POLYGON ((476591.824 5551094.749, 476627.388 5... 0.000000 POLYGON ((8.6726171 50.11163969999999, 8.67311...

Create PLZ PostGIS Insert SQL

  • we want to insert PLZ shapes in PostGIS to speed up the intersction
  • we need to generate the PLZ insert statement
In [44]:
psql_values = []
srid_val = 4326
for ix, r in shapes.iterrows():
    geom_str = str(r["geometry_wkt"]).replace("'", "''")
    id_str = str(ix+1)
    plz_str = str(r["plz"]).replace("'", "''")
    psql_values.append(
        f"('SRID={srid_val};{geom_str}', '{id_str}', '{plz_str}')"
    )
In [45]:
psql_values[:1]
Out[45]:
["('SRID=4326;POLYGON ((8.981242 49.60760729999999, 8.9831226 49.607481599999986, 8.9839902 49.60715969999999, 8.9849306 49.6069623, 8.9857386 49.60706100000001, 8.987645900000002 49.60765759999999, 8.9902738 49.60808429999999, 8.9897078 49.607830699999994, 8.9893726 49.607674, 8.9892666 49.60749119999999, 8.9893634 49.6073804, 8.989273 49.60736619999999, 8.9888291 49.607296599999984, 8.9886432 49.60723809999999, 8.9883976 49.6070561, 8.988382900000001 49.6069986, 8.9885279 49.6069038, 8.9887896 49.6066448, 8.9887932 49.6065208, 8.9886254 49.6064056, 8.9884644 49.60633569999999, 8.9881945 49.606218599999984, 8.9879488 49.606137600000004, 8.9877379 49.606041600000005, 8.9874601 49.60593899999999, 8.9872689 49.605948000000005, 8.987180700000001 49.6058684, 8.9870694 49.6058154, 8.9868486 49.6057656, 8.9866614 49.605761199999996, 8.9862462 49.605642399999994, 8.9859084 49.60554520000001, 8.9853936 49.60546889999999, 8.9852962 49.605471800000004, 8.985121 49.6056152, 8.9850337 49.605632499999984, 8.9847728 49.6056233, 8.9846206 49.60563609999999, 8.9843123 49.6057247, 8.9841345 49.60576489999999, 8.9839234 49.6058125, 8.983551200000003 49.605976999999996, 8.9832349 49.6060953, 8.9830209 49.60623809999999, 8.9827014 49.60639489999999, 8.982366 49.6065717, 8.9821344 49.60667750000001, 8.9820461 49.606734499999995, 8.9819435 49.6069, 8.9817045 49.6071332, 8.9813973 49.6073025, 8.981242 49.60760729999999))', '1', '64743')"]
In [46]:
sql_query = f"""
INSERT INTO mviews.plz_intersect (geom, id, plz)\n
VALUES\n{',\n'.join(psql_values)};
"""
In [50]:
Code(sql_query[:5000], language='sql')
Out[50]:
INSERT INTO mviews.plz_intersect (geom, id, plz)

VALUES
('SRID=4326;POLYGON ((8.981242 49.60760729999999, 8.9831226 49.607481599999986, 8.9839902 49.60715969999999, 8.9849306 49.6069623, 8.9857386 49.60706100000001, 8.987645900000002 49.60765759999999, 8.9902738 49.60808429999999, 8.9897078 49.607830699999994, 8.9893726 49.607674, 8.9892666 49.60749119999999, 8.9893634 49.6073804, 8.989273 49.60736619999999, 8.9888291 49.607296599999984, 8.9886432 49.60723809999999, 8.9883976 49.6070561, 8.988382900000001 49.6069986, 8.9885279 49.6069038, 8.9887896 49.6066448, 8.9887932 49.6065208, 8.9886254 49.6064056, 8.9884644 49.60633569999999, 8.9881945 49.606218599999984, 8.9879488 49.606137600000004, 8.9877379 49.606041600000005, 8.9874601 49.60593899999999, 8.9872689 49.605948000000005, 8.987180700000001 49.6058684, 8.9870694 49.6058154, 8.9868486 49.6057656, 8.9866614 49.605761199999996, 8.9862462 49.605642399999994, 8.9859084 49.60554520000001, 8.9853936 49.60546889999999, 8.9852962 49.605471800000004, 8.985121 49.6056152, 8.9850337 49.605632499999984, 8.9847728 49.6056233, 8.9846206 49.60563609999999, 8.9843123 49.6057247, 8.9841345 49.60576489999999, 8.9839234 49.6058125, 8.983551200000003 49.605976999999996, 8.9832349 49.6060953, 8.9830209 49.60623809999999, 8.9827014 49.60639489999999, 8.982366 49.6065717, 8.9821344 49.60667750000001, 8.9820461 49.606734499999995, 8.9819435 49.6069, 8.9817045 49.6071332, 8.9813973 49.6073025, 8.981242 49.60760729999999))', '1', '64743'),
('SRID=4326;POLYGON ((11.394678300000002 48.1472871, 11.394898299999998 48.1477954, 11.3949251 48.1483573, 11.395251899999998 48.14959039999999, 11.3954508 48.1502721, 11.3959122 48.151149, 11.396405699999999 48.1520008, 11.3971889 48.1529098, 11.3983331 48.15404590000001, 11.4007079 48.156245199999994, 11.4021456 48.15762649999999, 11.4032495 48.1586135, 11.4052606 48.1604154, 11.4065501 48.15678140000001, 11.407023600000002 48.1566893, 11.407318299999998 48.15663609999999, 11.4078071 48.15655340000001, 11.4082949 48.15648079999999, 11.409023499999998 48.15638849999999, 11.4096578 48.15631460000001, 11.409747299999998 48.156304199999994, 11.409647800000002 48.1562481, 11.4096284 48.1562146, 11.409633300000001 48.1561622, 11.4097001 48.15608170000001, 11.409767799999997 48.1560365, 11.4098627 48.155996699999996, 11.409935400000002 48.155971599999994, 11.410056099999998 48.15594639999999, 11.4105684 48.1558723, 11.4107632 48.1558446, 11.4109473 48.1558122, 11.4110492 48.155786299999995, 11.4111416 48.15575189999999, 11.411212299999999 48.1556931, 11.411251699999998 48.1556203, 11.4112499 48.15556479999999, 11.411235900000001 48.155502299999995, 11.4111941 48.15538089999999, 11.411009 48.1550083, 11.410815199999998 48.1546384, 11.410721 48.15444939999999, 11.4106595 48.1543167, 11.4105222 48.153991500000004, 11.4104028 48.15370829999999, 11.4102724 48.15347999999999, 11.4101902 48.1533417, 11.4100479 48.15311079999999, 11.4100052 48.15306189999999, 11.409945499999997 48.15299079999999, 11.4098717 48.152919100000005, 11.4098572 48.15290209999999, 11.4097638 48.1527935, 11.4097128 48.15269119999999, 11.4096976 48.1525976, 11.409692499999998 48.152513899999995, 11.4092912 48.15243160000001, 11.4092163 48.152418000000004, 11.4088783 48.152356399999995, 11.4087817 48.1523305, 11.4087506 48.1523001, 11.408731599999998 48.15225419999999, 11.4087432 48.1521922, 11.4090396 48.151277799999995, 11.4094051 48.15009619999999, 11.4094201 48.14996970000001, 11.4094197 48.1498532, 11.409400699999999 48.149699899999995, 11.4093732 48.1495581, 11.409288 48.14907579999999, 11.409315599999998 48.1490763, 11.4096411 48.148544799999996, 11.409742 48.148144, 11.4098576 48.1476477, 11.410153999999999 48.1469563, 11.4102982 48.1466259, 11.4103223 48.14638289999999, 11.4105088 48.1461167, 11.410437299999998 48.14607039999999, 11.4103874 48.146018999999995, 11.4103445 48.1459668, 11.410329999999998 48.145946699999996, 11.410312500000002 48.1459073, 11.4102935 48.14586549999999, 11.410280799999999 48.14578, 11.410291500000001 48.14568919999999, 11.4103225 48.14558559999999, 11.4103905 48.1455922, 11.410422799999997 48.145496299999984, 11.4109737 48.14357609999999, 11.4115328 48.141574399999996, 11.4116783 48.14113340000001, 11.4041009 48.14013169999999, 11.4021671 48.1398722, 11.4015073 48.139741599999994, 11.401011 48.13961630000001, 11.400193 48.139379999999996, 11.398997399999999 48.13905969999999, 11.398798199999998 48.13908819999999, 11.398664099999998 48.13922959999999, 11.3983718 48.13965559999999, 11.398154499999999 48.139841799999985, 11.3977575 48.14002790000001, 11.397199599999999 48.14007449999999, 11.3967329 48.14025349999999, 11.3965505 48.14055419999999, 11.3956869 48.14229739999999, 11.3952362 48.14355029999999, 11.39507 48.14441289999999, 11.395037799999999 48.145085900000005, 11.395075300000002 48.1459234, 11.395000200000002 48.1463816, 11.3948446 48.14654269999999, 11.394678300000002 48.1472871))', '2', '81248'),
('SRID=4326;POLYGON ((8.6725366 50.1126
In [48]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
INSERT 0 8170

PostGIS SQL Query DB: Performance tweaks

  • Create a spatial index, to speed up intersection
  • Also, allow the foreign data wrapper to use remote indexes
In [7]:
sql_query = f"""
CREATE INDEX idx_plz_intersect_geom ON mviews.plz_intersect USING GIST (geom);
ANALYZE mviews.plz_intersect;
ALTER SERVER lbsnraw OPTIONS (SET use_remote_estimate 'on');
ALTER FOREIGN TABLE mviews.all_posts_de OPTIONS (ADD use_remote_estimate 'true');
"""
In [ ]:
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)

We have now inserted 8170 PLZ shapes into our PostGIS table.

Store to DB:

In [49]:
db_connection_hll.commit()

Run the Query

The last thing to do is the actual HLL aggregate SQL. Since this is a long running query, you may want to directly connect to the postgres backend and issue the command below with a terminal multiplexer (e.g. byobu).

E.g.:

byobu
docker exec -it lbsn-hlldb /bin/bash
psql -h localhost -p 5432 -U postgres hlldb
...
[F6]

Explain analyze before:

SET work_mem = '4GB';
EXPLAIN (VERBOSE, ANALYZE, BUFFERS)
SELECT
    plz.plz,
    plz.geom,
    hll_add_agg(hll_hash_text(s.user_guid)) as user_hll,
    hll_add_agg(hll_hash_text(s.post_guid)) as post_hll
FROM (
    SELECT * FROM mviews.plz_intersect LIMIT 5
) AS plz
JOIN mviews.all_posts_de AS s -- This is your FDW
ON plz.geom && s.post_latlng AND ST_Intersects(plz.geom, s.post_latlng)
GROUP BY plz.plz, plz.geom;
See my Query Plan results
                                                                     QUERY PLAN
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=674.01..674.09 rows=5 width=11475) (actual time=927.549..927.572 rows=5 loops=1)
   Group Key: plz_intersect.plz, plz_intersect.geom
   Batches: 1  Memory Usage: 1394kB
   Buffers: shared hit=1
   ->  Nested Loop  (cost=100.44..669.02 rows=333 width=11475) (actual time=4.069..926.152 rows=1357 loops=1)
         Buffers: shared hit=1
         ->  Limit  (cost=0.00..1.52 rows=5 width=11415) (actual time=0.009..0.021 rows=5 loops=1)
               Buffers: shared hit=1
               ->  Seq Scan on plz_intersect  (cost=0.00..2482.70 rows=8170 width=11415) (actual time=0.008..0.017 rows=5 loops=1)
                     Buffers: shared hit=1
         ->  Foreign Scan on all_posts_de s  (cost=100.44..133.48 rows=1 width=120) (actual time=184.955..184.968 rows=271 loops=5)
 Planning:
   Buffers: shared hit=20 read=4 dirtied=1
 Planning Time: 289.619 ms                                                                                 
 Execution Time: 928.019 ms
(15 rows)

Summary:

  1. Time per PLZ (average from this test): ~185 ms (from the foreign scan time per loop).
  2. Total PLZs: 8170 (from Seq Scan on plz_intersect ... rows=8170).
  3. Estimated execution time for FDW part: 8170 * 0.185 seconds = ~1511 seconds = ~25 minutes.
  4. Add some time for local aggregation and overhead.

Then run the full query:

SET work_mem = '4GB';
CREATE MATERIALIZED VIEW mviews.plz_social_de AS
    SELECT
        plz.plz,
        plz.geom,
        hll_add_agg(hll_hash_text(user_guid)) as user_hll,
        hll_add_agg(hll_hash_text(post_guid)) as post_hll
    FROM mviews.plz_intersect AS plz
    JOIN mviews.all_posts_de AS s
    ON plz.geom && s.post_latlng AND ST_Intersects(plz.geom, s.post_latlng)
    GROUP BY plz.plz, plz.geom;

The optimized query took about 2 hours.

To speed up the process:

  • make sure that there is an index on the foreign data table for mviews.all_posts_de

  • since mviews.all_posts_de is a materialized view itself, an index needs to be explicitly created, even if the original parent table from which the mview was pulled already has an index

    • This is the SQL command:
    CREATE INDEX IF NOT EXISTS idx_mv_all_posts_de_post_latlng_gist
    ON mviews.all_posts_de USING GIST (post_latlng);
    ANALYZE mviews.all_posts_de;
    

  • otherwise, without the two indexes (plz_intersect, all_posts_de), PostgreSQL is doing up to 9000 × 50,000,000 = 450 billion geometry comparisons.

Query results

Query results and store as CSV, to be visualized in the next notebook.

In [39]:
db_conn_hll.query(f"SELECT * FROM mviews.plz_social_de LIMIT 10;")
Out[39]:
plz geom user_hll post_hll
0 01067 0106000020E61000000100000001030000000100000059... \x148b7f31ce943508294c7319082a0e75a0c62950831d... \x148b7f41d094990749989429074a50c49ce841d694a1...
1 01069 0106000020E61000000200000001030000000100000035... \x148b7f2990c43548314cc394e839cc632125320a629c... \x148b7f7a1294a12741cc75256842108524e94a8e849d...
2 01097 0106000020E6100000010000000103000000010000003D... \x148b7f3148938c8821486324e329864320c521483110... \x148b7f498c84ace7314c5390cb39525418a6310a6418...
3 01099 0106000020E61000000100000001030000000200000099... \x148b7f214c9218e819cc6314e8288a63208721106214... \x148b7f41cc83a507421264a8c8325473a8a839cea59c...
4 01108 0106000020E6100000010000000103000000010000001B... \x138b7f00a300e201010143034203c204c1058105e506... \x148b7f00c883086449441094c3108a5208621886010c...
5 01109 0106000020E61000000100000001030000000100000017... \x148b7f08c80090821948331042208233048511c6221c... \x148b7f29866110c5110c610d0329ca51988619c6418c...
6 01127 0106000020E6100000020000000103000000010000004A... \x148b7f10c2009063115023042120863208603044611c... \x148b7f19043214842888528963314641946430cc220c...
7 01129 0106000020E61000000100000001030000000100000085... \x148b7f10c4010440080813080100c401008100400188... \x148b7f0844110da208c4418442188c61088210420198...
8 01139 0106000020E6100000010000000103000000010000001B... \x148b7f1906109c2431022308e4108020a08500c6111c... \x148b7f20c861a4a32984310883398e11986419cc328c...
9 01156 0106000020E61000000100000001030000000100000036... \x148b7f28845384621142130882184600a04008841110... \x148b7f30c8118441188441084110ca41188510c41104...

We can see that the HLL shards are populated. To retrieve estimate numbers of user and post count, we calculate the cardinality below (2.5% estimation error).

Since some PLZ polygons have islands, we do another "group-by-and-union-hll" operation to summarize total user and post counts per PLZ.

In [61]:
sql_query = f"""
    SELECT
        s.plz as "PLZ",
        hll_cardinality(hll_union_agg(post_hll))::int AS "postcount",
        hll_cardinality(hll_union_agg(user_hll))::int AS "usercount"
    FROM mviews.plz_social_de AS s
    GROUP BY s.plz
    ORDER BY postcount DESC;
    """
result = db_conn_hll.query(sql_query)
result.head()
Out[61]:
PLZ postcount usercount
0 10179 3638747 286920
1 21706 1157988 58055
2 10117 1097835 367060
3 80636 1026438 83436
4 10178 1015720 389794

PLZ 10179 has 3,638,747 posts from 286920 users. What is this PLZ?

In [59]:
shapes = gp.read_file(plz_shapes)
In [60]:
shapes.set_index("plz", inplace=True)
shapes.loc["10179"]
Out[60]:
note                                        10179 Berlin Mitte
einwohner                                                18664
qkm                                                   2.175681
geometry     POLYGON ((13.4016439 52.511369, 13.4021162 52....
Name: 10179, dtype: object

This is one confirmation check that our spatial intersection was correct, as Berlin Centre is expected to show the most social media posts.

Write to CSV

In [63]:
%%time
usecols = ["PLZ", "postcount", "usercount"]
filename = OUTPUT / f"2025-06-11_PLZ_DE_Geosocial.csv"

result.to_csv(
    filename, columns=usecols,
    index=False, header=True)
CPU times: user 9.12 ms, sys: 0 ns, total: 9.12 ms
Wall time: 11.2 ms

Create notebook HTML

In [8]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./06_plz_intersection.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-
In [ ]:
 

IOER RDC Jupyter Base Template v0.10.0