PLZ Intersection Geosocial Data for DE ¶
Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)
Base data:
- 50 Million Geosocial Media posts in HLL format
- sorted based on user-id, number of posts per user-id ascending
- ~9000 PLZ areas (shapes)
Preparations¶
Install additional dependencies not in worker_env
import sys
pyexec = sys.executable
print(f"Current Kernel {pyexec}")
!../py/modules/base/pkginstall.sh "{pyexec}" owslib
Load dependencies:
import os
import csv
import sys
import colorcet
import psycopg2 # Postgres API
import geoviews as gv
import holoviews as hv
import mapclassify as mc
import geopandas as gp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import geoviews.feature as gf
import requests
from owslib.wfs import WebFeatureService
from pathlib import Path
from collections import namedtuple
from pathlib import Path
from typing import List, Tuple, Dict, Optional
from pyproj import Transformer, CRS, Proj
from geoviews import opts
from shapely.geometry import shape, Point, Polygon
from shapely.ops import transform
from holoviews import dim
from cartopy import crs
from matplotlib import colors
from IPython.display import clear_output, display, HTML, Markdown, Code
from bokeh.models import HoverTool, CustomJSTickFormatter, FixedTicker
# optionally, enable shapely.speedups
# which makes some of the spatial
# queries running faster
import shapely.speedups as speedups
from shapely import geometry
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Load helper module.
module_path = str(Path.cwd().parents[0] / "py")
if module_path not in sys.path:
sys.path.append(module_path)
from modules.base import tools, preparations
Initialize Bokeh and shapely.speedups
preparations.init_imports()
Parameters¶
Define initial parameters that affect processing
WORK_DIR = Path.cwd().parents[0] / "tmp" # Working directory
EPSG_CODE = 25832 # Target projection EPSG Code
CRS_PROJ = f"epsg:{EPSG_CODE}" # Target projection
CRS_WGS = "epsg:4326" # Input projection (Web Mercator)
OUTPUT = Path.cwd().parents[0] / "out" # Define path to output directory (figures etc.)
SHAPE_DIR = OUTPUT / "shapes" # Shapefiles
WORK_DIR.mkdir(exist_ok=True)
tools.create_paths(OUTPUT)
PLZ shapes¶
The BKG says that access to the PLZ dataset is restricted to Bundesbehörden: https://gdz.bkg.bund.de/index.php/default/postleitzahlgebiete-deutschland-plz.html
However, there is an open data available at suche-postleitzahl.org
We will query the Shapefile and write all shapes to a local geopackage.
plz_url = "https://downloads.suche-postleitzahl.org/v2/public/plz-5stellig.shp.zip"
tools.get_zip_extract(
output_path=OUTPUT,
uri_filename=plz_url)
plz_shapes = OUTPUT / "plz-5stellig.shp"
Load from file and plot preview
shapes = gp.read_file(plz_shapes)
shapes.columns
ax = shapes.plot()
ax.set_axis_off()
Select all larger Cities¶
tools.drop_cols_except(df=shapes, columns_keep=["geometry", "einwohner", "plz", "qkm"])
shapes.head()
Calculate Einwohner/sqkm
shapes["esqkm"] = shapes["einwohner"] / shapes["qkm"]
shapes["esqkm"].sort_values(ascending=False)
Project to UTM32N
shapes.to_crs(epsg=25832, inplace=True)
lcities_shapes = shapes[shapes["esqkm"]>5000]
lcities_shapes.set_index("plz", inplace=True)
lcities_shapes.head()
Load Bundesländer shapes, for plotting references.
de_shapes = tools.get_shapes("de", shape_dir=SHAPE_DIR)
de_shapes = de_shapes.to_crs(f"EPSG:{EPSG_CODE}")
Plot preview in two layers, select all larger cities.
plt_kwags = {
"color":'none',
"edgecolor":'black',
"linewidth":0.1,
"figsize":(2, 4),
}
ax = de_shapes.plot(**plt_kwags)
ax = lcities_shapes.plot(ax=ax, color="red")
ax.set_axis_off()
ax.set_title("PLZ with more than 5000 Einwohner/ square kilometre")
shapes.crs
Intersect with Social Media Data¶
Select a single PLZ from our previous dataset
berlin_shp = lcities_shapes[lcities_shapes.index=="97072"]
ax = berlin_shp.plot(**plt_kwags)
ax.set_axis_off()
Simplify shape with a 1000
m tolerance.
berlin_shp = berlin_shp.simplify(tolerance=50)
ax = berlin_shp.plot(**plt_kwags)
ax.set_axis_off()
Get the Well-Known Text (WKT) from the shape, for using this in the PostGis query. Note that we project the shape back to WGS1984, as this is what PostGis expects for the respective column projection.
berlin_wkt = berlin_shp.to_crs(CRS_WGS).iloc[0].wkt
berlin_wkt[:50]
from dotenv import load_dotenv
load_dotenv(
Path.cwd().parents[0] / '.env', override=True)
db_user = "postgres"
db_pass = os.getenv('POSTGRES_PASSWORD')
db_host = "hlldb"
db_port = "5432"
db_name = "hlldb"
db_connection_hll = psycopg2.connect(
host=db_host,
port=db_port,
dbname=db_name,
user=db_user,
password=db_pass
)
db_conn_hll = tools.DbConn(db_connection_hll)
cur_hll = db_connection_hll.cursor()
cur_hll.execute("SELECT 1;")
db_conn = tools.DbConn(db_connection_hll)
print(cur_hll.statusmessage)
MVIEWS_REF = "all_posts_de"
Check if foreign table has been imported already:
sql_query = f"""
SELECT EXISTS (
SELECT FROM information_schema.tables
WHERE table_schema = 'mviews'
AND table_name = '{MVIEWS_REF}'
);
"""
result = db_conn_hll.query(sql_query)
result["exists"][0]
Optional cleanup:
sql_query = f"""
DROP TABLE IF EXISTS mviews.plz_intersect;
"""
cur_hll.execute(sql_query)
%%time
sql_query = f"""
CREATE TABLE IF NOT EXISTS mviews.plz_intersect (
id SERIAL PRIMARY KEY,
geom geometry(MultiPolygon, 4326) NOT NULL,
plz text
);
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
Store PLZ as WKT:
- first project to WGS1984
- then convert geometry using geopandas
.wkt
function
from shapely.geometry import Polygon, MultiPolygon
shapes["geometry_wkt"] = shapes["geometry"].to_crs(epsg=4326).apply(
lambda geom: geom.wkt if isinstance(geom, (Polygon, MultiPolygon)) else None
)
shapes.head()
Create PLZ PostGIS Insert SQL
- we want to insert PLZ shapes in PostGIS to speed up the intersction
- we need to generate the PLZ insert statement
psql_values = []
srid_val = 4326
for ix, r in shapes.iterrows():
geom_str = str(r["geometry_wkt"]).replace("'", "''")
id_str = str(ix+1)
plz_str = str(r["plz"]).replace("'", "''")
psql_values.append(
f"('SRID={srid_val};{geom_str}', '{id_str}', '{plz_str}')"
)
psql_values[:1]
sql_query = f"""
INSERT INTO mviews.plz_intersect (geom, id, plz)\n
VALUES\n{',\n'.join(psql_values)};
"""
Code(sql_query[:5000], language='sql')
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
PostGIS SQL Query DB: Performance tweaks¶
- Create a spatial index, to speed up intersection
- Also, allow the foreign data wrapper to use remote indexes
sql_query = f"""
CREATE INDEX idx_plz_intersect_geom ON mviews.plz_intersect USING GIST (geom);
ANALYZE mviews.plz_intersect;
ALTER SERVER lbsnraw OPTIONS (SET use_remote_estimate 'on');
ALTER FOREIGN TABLE mviews.all_posts_de OPTIONS (ADD use_remote_estimate 'true');
"""
cur_hll.execute(sql_query)
print(cur_hll.statusmessage)
We have now inserted 8170 PLZ shapes into our PostGIS table.
Store to DB:
db_connection_hll.commit()
Run the Query¶
The last thing to do is the actual HLL aggregate SQL. Since this is a long running query, you may want to directly connect to the postgres backend and issue the command below with a terminal multiplexer (e.g. byobu).
E.g.:
byobu
docker exec -it lbsn-hlldb /bin/bash
psql -h localhost -p 5432 -U postgres hlldb
...
[F6]
Explain analyze before:
SET work_mem = '4GB';
EXPLAIN (VERBOSE, ANALYZE, BUFFERS)
SELECT
plz.plz,
plz.geom,
hll_add_agg(hll_hash_text(s.user_guid)) as user_hll,
hll_add_agg(hll_hash_text(s.post_guid)) as post_hll
FROM (
SELECT * FROM mviews.plz_intersect LIMIT 5
) AS plz
JOIN mviews.all_posts_de AS s -- This is your FDW
ON plz.geom && s.post_latlng AND ST_Intersects(plz.geom, s.post_latlng)
GROUP BY plz.plz, plz.geom;
See my Query Plan results
QUERY PLAN
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=674.01..674.09 rows=5 width=11475) (actual time=927.549..927.572 rows=5 loops=1)
Group Key: plz_intersect.plz, plz_intersect.geom
Batches: 1 Memory Usage: 1394kB
Buffers: shared hit=1
-> Nested Loop (cost=100.44..669.02 rows=333 width=11475) (actual time=4.069..926.152 rows=1357 loops=1)
Buffers: shared hit=1
-> Limit (cost=0.00..1.52 rows=5 width=11415) (actual time=0.009..0.021 rows=5 loops=1)
Buffers: shared hit=1
-> Seq Scan on plz_intersect (cost=0.00..2482.70 rows=8170 width=11415) (actual time=0.008..0.017 rows=5 loops=1)
Buffers: shared hit=1
-> Foreign Scan on all_posts_de s (cost=100.44..133.48 rows=1 width=120) (actual time=184.955..184.968 rows=271 loops=5)
Planning:
Buffers: shared hit=20 read=4 dirtied=1
Planning Time: 289.619 ms
Execution Time: 928.019 ms
(15 rows)
Summary:
- Time per PLZ (average from this test): ~185 ms (from the foreign scan time per loop).
- Total PLZs: 8170 (from Seq Scan on plz_intersect ... rows=8170).
- Estimated execution time for FDW part: 8170 * 0.185 seconds = ~1511 seconds = ~25 minutes.
- Add some time for local aggregation and overhead.
Then run the full query:
SET work_mem = '4GB';
CREATE MATERIALIZED VIEW mviews.plz_social_de AS
SELECT
plz.plz,
plz.geom,
hll_add_agg(hll_hash_text(user_guid)) as user_hll,
hll_add_agg(hll_hash_text(post_guid)) as post_hll
FROM mviews.plz_intersect AS plz
JOIN mviews.all_posts_de AS s
ON plz.geom && s.post_latlng AND ST_Intersects(plz.geom, s.post_latlng)
GROUP BY plz.plz, plz.geom;
The optimized query took about 2 hours.
To speed up the process:
make sure that there is an index on the foreign data table for
mviews.all_posts_de
since
mviews.all_posts_de
is a materialized view itself, an index needs to be explicitly created, even if the original parent table from which the mview was pulled already has an index- This is the SQL command:
CREATE INDEX IF NOT EXISTS idx_mv_all_posts_de_post_latlng_gist ON mviews.all_posts_de USING GIST (post_latlng); ANALYZE mviews.all_posts_de;
- This is the SQL command:
otherwise, without the two indexes (
plz_intersect
,all_posts_de
), PostgreSQL is doing up to 9000 × 50,000,000 = 450 billion geometry comparisons.
Query results¶
Query results and store as CSV, to be visualized in the next notebook.
db_conn_hll.query(f"SELECT * FROM mviews.plz_social_de LIMIT 10;")
We can see that the HLL shards are populated. To retrieve estimate numbers of user and post count, we calculate the cardinality below (2.5% estimation error).
Since some PLZ polygons have islands, we do another "group-by-and-union-hll" operation to summarize total user and post counts per PLZ.
sql_query = f"""
SELECT
s.plz as "PLZ",
hll_cardinality(hll_union_agg(post_hll))::int AS "postcount",
hll_cardinality(hll_union_agg(user_hll))::int AS "usercount"
FROM mviews.plz_social_de AS s
GROUP BY s.plz
ORDER BY postcount DESC;
"""
result = db_conn_hll.query(sql_query)
result.head()
PLZ 10179
has 3,638,747
posts from 286920
users. What is this PLZ?
shapes = gp.read_file(plz_shapes)
shapes.set_index("plz", inplace=True)
shapes.loc["10179"]
This is one confirmation check that our spatial intersection was correct, as Berlin Centre is expected to show the most social media posts.
Write to CSV¶
%%time
usecols = ["PLZ", "postcount", "usercount"]
filename = OUTPUT / f"2025-06-11_PLZ_DE_Geosocial.csv"
result.to_csv(
filename, columns=usecols,
index=False, header=True)
Create notebook HTML¶
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./06_plz_intersection.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&-