Digital Landscape Traces: Visualization ¶
Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)
This notebook creates the final high-resolution spatial visualization of social media posts in Germany. It builds upon all previous steps by combining the full-resolution user coordinates with the final user classifications.
The key objectives are:
- Full Resolution: To eliminate the grid-like artifacts from the second notebook by using the original, only slightly geohash-aggregated, post coordinates from the Parquet Files.
- Categorical Representation: To color each pixel on the map based on the majority user type (Local, Tourist, or Unclassified) present in that area.
- Geographic Context: To overlay Germany's national border for clear geographic reference.
The workflow uses a Dask/Parquet pipeline to handle the large volume of full-resolution coordinates efficiently.
Preparations¶
First, we import the necessary libraries and configure the notebook for development with automatic module reloading.
import os
import sys
import numpy as np
import pyarrow as pa
import xarray as xr
from pathlib import Path
import pandas as pd
import geopandas as gp
import dask.dataframe as dd
import dask.diagnostics as diag
import datashader as ds
import matplotlib.pyplot as plt
import datashader.transfer_functions as tf
from datashader.utils import lnglat_to_meters
from IPython.display import clear_output, display
from datashader.colors import rgb
import holoviews as hv
from holoviews.operation.datashader import datashade, dynspread
import hvplot.pandas
import xyzservices.providers as xyz
import hvplot.dask
import datashader as ds
from typing import Tuple, Optional
Load dependencies:
Activate autoreload of changed python files:
%load_ext autoreload
%autoreload 2
Parameters¶
Define initial parameters that affect processing
OUTPUT = Path.cwd().parents[0] / "out"
WORK_DIR = Path.cwd().parents[0] / "tmp"
PARQUET_OUTPUT = OUTPUT / "de_classified_points.parquet"
PARQUET_OUTSIDE_DE = OUTPUT / "de_outside_background.parquet"
Categorical Visualization with Datashader¶
With the data prepared, we can now create our visualization. We will follow the method from the Census example, defining a color key and using ds.by() to aggregate the data by our classification column.
df = dd.read_parquet(PARQUET_OUTPUT)
df = df.persist()
print(f"Data loaded with {len(df):,} points.")
df.columns
df.head(20)
Define the color key for our categories and the boundary for the map.
DE_BOUNDS = ((4.605469, 16.37207), (46.697243, 55.685885))
DRESDEN_BOUNDS = (( 13.415680, 14.703827), ( 50.740090, 51.194905))
BERLIN_BOUNDS = (( 12.843018, 14.149704), ( 52.274880, 52.684292))
LEIPZIG_BOUNDS = ((12.203201, 12.554764), (51.266002, 51.375322))
MUNICH_BOUNDS = ((11.240593, 11.943718), (48.004232, 48.237774))
FRANKFURT_BOUNDS = ((5.478882, 11.103882), (49.150450, 50.947349))
BALTIC_COAST_BOUNDS = ((8.942322, 14.567322), (53.308202, 54.947941))
color_key = {
'Local': 'blue',
'Tourist': 'red',
'Unclassified': 'cornflowerblue'
# 'Unclassified': 'slategrey'
}
NUTS_GPKG_FILE = "NUTS_RG_01M_2024_4326.gpkg"
if not Path(OUTPUT / NUTS_GPKG_FILE).exists():
tools.get_stream_file(
f"https://gisco-services.ec.europa.eu/distribution/v2/nuts/gpkg/{NUTS_GPKG_FILE}", OUTPUT / NUTS_GPKG_FILE)
else: print("Already exists")
nuts = gp.read_file(OUTPUT / NUTS_GPKG_FILE)
nuts1_de = nuts[
# (nuts['CNTR_CODE'] == 'DE') &
(nuts['LEVL_CODE'] == 0)
]
Initial Visualization: A "Winner-Takes-All" Approach¶
With the data prepared, our first step is to create a baseline visualization. The function below uses a "winner-takes-all" method: for each pixel on the map, it determines which category of user (Local, Tourist, or Unclassified) is most numerous and assigns that category's color to the pixel.
This approach is effective for quickly showing the dominant user type in any given area. However, as we will see, it has a key limitation: it treats a pixel with 10,000 points the same as a pixel with 10 points, hiding the crucial information about data density and activity levels.
def create_bordered_map(
df, bounds: Tuple[Tuple[float, float]], border_gdf, color_key, plot_width=1200, background='white'
):
"""
Generates a datashaded categorical map using a "winner-takes-all" method
and overlays country borders, correctly zoomed to the target area.
"""
x_range, y_range = bounds
# Calculate height and project bounds (unchanged)
lng_width = x_range[1] - x_range[0]
lat_height = y_range[1] - y_range[0]
plot_height = int(((plot_width * lat_height) / lng_width) * 1.5)
x_coords, y_coords = lnglat_to_meters(x_range, y_range)
# Create the Datashader canvas (unchanged)
cvs = ds.Canvas(
plot_width=plot_width, plot_height=plot_height,
x_range=x_coords, y_range=y_coords
)
# Create the 3D (y, x, classification) aggregate object.
print("Aggregating points by category...")
with diag.ProgressBar():
agg = cvs.points(df, x='x', y='y', agg=ds.by('classification'))
# Use argmax() to find the integer index of the majority category for each pixel.
# The result is a 2D array of integers (0 for 'Local', 1 for 'Tourist', etc.).
print("Finding majority category for each pixel...")
majority_indices = agg.argmax(dim='classification')
# Mask out pixels that have no data points at all.
majority_indices = majority_indices.where(agg.sum(dim='classification') > 0)
# Create a simple list of colors in the correct order.
# This order matches the integer indices from the argmax() step.
cats = list(agg.coords['classification'].values)
color_list = [color_key[cat] for cat in cats]
print("Shading image...")
img = tf.shade(majority_indices, cmap=color_list)
# --- Create the Final Plot with Matplotlib ---
print("Rendering final plot with borders...")
fig, ax = plt.subplots(1, 1, figsize=(15, 15), facecolor=background)
ax.imshow(img.to_pil(), extent=[x_coords[0], x_coords[1], y_coords[0], y_coords[1]])
border_gdf.to_crs(epsg=3857).plot(
ax=ax, facecolor='none', edgecolor='black', linewidth=0.5, alpha=0.7
)
ax.set_xlim(x_coords)
ax.set_ylim(y_coords)
ax.set_axis_off()
plt.tight_layout()
plt.show()
return img
img = create_bordered_map(
df,
DE_BOUNDS,
nuts1_de,
color_key,
background='white'
)
Advanced Visualization: Adding Density and Weighting¶
The simple map above is a good start, but it has a key limitation: every data point is rendered with the same intensity, regardless of whether a pixel represents a single post or ten thousand. This makes it difficult to distinguish high-activity urban centers from sparse rural areas.
The new create_bordered_map_alpha() function creates a much richer visualization by independently controlling three key visual elements:
The 'Winner-Takes-All' Color: The color of each pixel (red, blue, or grey) is still determined by the majority category. Crucially, this version introduces a
local_weightparameter. This allows us to apply a 'boost' to the 'Local' category to computationally balance the 1:2.4 data imbalance, giving a more representative view of local activity.The Data Density (Opacity): The opacity of each pixel is now scaled based on the total number of original data points it contains. We use a technique called histogram equalization (
eq_hist) to ensure that the full range of transparency is used effectively. The result is that dense urban areas appear crisp and opaque, while sparse rural points are visible but faint, providing a clear visual guide to data validity and activity levels.A Solid Background: To ensure a correct and consistent output, the function uses a robust "White Canvas" method. It builds the final image on an opaque canvas of the desired background color, avoiding some complex blending errors that can occur when compositing transparent layers.
We also want to show data surrounding Germany, as supplementary information, without classification. This is done below, before defining the core create_bordered_map_alpha() method.
print("\nLoading background context data (outside Germany)...")
df_outside = dd.read_parquet(PARQUET_OUTSIDE_DE).persist()
print(f"-> Loaded {len(df_outside):,} background points.")
This data contains only the coordinates, no classification.
df_outside.head(5)
def create_bordered_map_alpha(
df, df_bg, bounds: Tuple[Tuple[float, float]], border_gdf, color_key, background='white',
plot_width=2400, local_weight=1.0, return_raster: Optional[bool] = None
):
"""
Creates a high-resolution static map of categorical point data.
Renders a "winner-takes-all" color for each pixel with opacity scaled
by point density, composited on a solid background color. Allows for
re-weighting of the 'Local' category to adjust for class imbalance.
Uses df_bg to add grayscale background data.
"""
x_range, y_range = bounds
# 1. Prepare a high-resolution canvas with the correct aspect ratio.
lng_width = x_range[1] - x_range[0]
lat_height = y_range[1] - y_range[0]
plot_height = int(((plot_width * lat_height) / lng_width) * 1.5)
x_coords, y_coords = lnglat_to_meters(x_range, y_range)
cvs = ds.Canvas(
plot_width=plot_width, plot_height=plot_height,
x_range=x_coords, y_range=y_coords
)
# 2.1 Aggregate points by category to get counts for each pixel.
with diag.ProgressBar():
agg = cvs.points(df, x='x', y='y', agg=ds.by('classification'))
# 2.2 Create BACKGROUND Layer
print("Aggregating and shading background layer with density...")
with diag.ProgressBar():
agg_bg = cvs.points(df_bg, x='x', y='y')
# Shade the background using a gradient and log scaling.
img_bg = tf.shade(
agg_bg,
cmap='#000000', # A single color for all points
how='log', # Use log scale to see both sparse and dense areas
# span=[0, 250], # Map the data range from 0 to [max_count] to the full alpha gradient.
alpha=240, # MAX opacity for the densest areas
min_alpha=30 # MIN opacity for the sparsest areas
)
# 3. Determine the "winning" category, applying weights if necessary.
cats = list(agg.coords['classification'].values)
weighted_agg = agg.astype(np.float64) # Use a float copy for weighting
if local_weight != 1.0 and 'Local' in cats:
print(f"Applying a weight of {local_weight} to the 'Local' category...")
local_idx = cats.index('Local')
weighted_agg.data[:, :, local_idx] *= local_weight
majority_indices = weighted_agg.argmax(dim='classification')
# 4. Create a density-based alpha channel from original (unweighted) counts.
original_total_counts = agg.sum(dim='classification')
img_alpha = tf.shade(original_total_counts, cmap="white", how='eq_hist')
alpha_channel = np.nan_to_num(img_alpha.data, nan=0).astype(np.uint8)
# 5. Build the final image using the "White Canvas" method.
height, width = alpha_channel.shape
data_mask = original_total_counts.data > 0
# Start with an opaque canvas of the specified background color.
bg_color_tuple = rgb(background)
final_rgba = np.full((height, width, 4), list(bg_color_tuple) + [255], dtype=np.uint8)
# Paint data pixels with their majority category color.
for i, cat in enumerate(cats):
if cat in color_key:
mask = (majority_indices.data == i) & data_mask
final_rgba[mask, :3] = rgb(color_key[cat])
# Apply the density-based alpha only to the data pixels.
final_rgba[data_mask, 3] = alpha_channel[data_mask]
final_image = tf.Image(final_rgba)
# 6. Render the final, opaque image with Matplotlib.
fig, ax = plt.subplots(1, 1, figsize=(15, 15), facecolor=background)
ax.imshow(
final_image.to_pil(),
extent=[x_coords[0], x_coords[1], y_coords[0], y_coords[1]])
ax.imshow(
img_bg.to_pil(),
extent=[x_coords[0], x_coords[1], y_coords[0], y_coords[1]])
border_gdf.to_crs(epsg=3857).plot(
ax=ax, facecolor='none', edgecolor='black', linewidth=0.5, alpha=0.7
)
ax.set_xlim(x_coords)
ax.set_ylim(y_coords)
ax.set_axis_off()
plt.tight_layout()
if return_raster:
return fig, final_image, img_bg
return fig
opts = {
"df": df,
"df_bg": df_outside,
"bounds": DE_BOUNDS,
"border_gdf": nuts1_de,
"color_key": color_key,
"background": 'white',
"local_weight": 1.0,
"return_raster": True,
}
fig, img_fg, img_bg = create_bordered_map_alpha(**opts)
Save this image as the final map.
def save_image(fig, output_name, img_fg=None, img_bg=None, bounds=None):
"""
Saves a Matplotlib figure (PNG) and optionally its constituent
datashader layers (foreground and background) as GeoTIFFs.
"""
# 1. Save the visual PNG from the matplotlib figure
png_filepath = OUTPUT / f"{output_name}.png"
relative_png_path = png_filepath.relative_to(Path.cwd().parents[1])
print(f"Saving composite PNG to ./{relative_png_path}")
fig.savefig(png_filepath, dpi=300, bbox_inches='tight', pad_inches=0)
plt.close(fig) # Close the figure to free up memory
# 2. Save the foreground GeoTIFF, if provided
if img_fg is not None:
tif_filepath_fg = OUTPUT / f'{output_name}_foreground.tif'
print(
f"Saving GeoTiff to ./"
f"{tif_filepath_fg.relative_to(Path.cwd().parents[1])}")
raster.save_datashader_to_geotiff(
img=img_fg,
filepath=tif_filepath_fg,
bounds_ll=bounds
)
# 3. Save the background GeoTIFF, if provided
if img_bg is not None:
tif_filepath_bg = OUTPUT / f'{output_name}_background.tif'
print(
f"Saving GeoTiff to ./"
f"{tif_filepath_bg.relative_to(Path.cwd().parents[1])}")
raster.save_datashader_to_geotiff(
img=img_bg,
filepath=tif_filepath_bg,
bounds_ll=bounds
)
print("Image saving complete.")
save_image(
fig=fig,
output_name='DE_locals_vs_tourists',
img_fg=img_fg,
img_bg=img_bg,
bounds=DE_BOUNDS
)
Zoom into different regions¶
To look into details for selected areas, we can easily create several zoomed-in views of key regions across Germany.
opts["bounds"] = DRESDEN_BOUNDS
opts["return_raster"] = False
fig = create_bordered_map_alpha(**opts)
opts["bounds"] = BERLIN_BOUNDS
fig = create_bordered_map_alpha(**opts)
opts["bounds"] = LEIPZIG_BOUNDS
fig = create_bordered_map_alpha(**opts)
opts["bounds"] = MUNICH_BOUNDS
fig = create_bordered_map_alpha(**opts)
opts["bounds"] = FRANKFURT_BOUNDS
fig = create_bordered_map_alpha(**opts)
opts["bounds"] = BALTIC_COAST_BOUNDS
fig = create_bordered_map_alpha(**opts)
Create release file¶
!rm ../out/*.zip
This Bash command cleans up any previous ZIP files in the out/ directory. The ! indicates it's a shell command, not Python.
Prepare a release
Make sure that 7z is available. Carto-Lab Docker comes with 7z. If you are using this in a rootfull container, you can use !apt install p7zip-full. Otherwise (e.g. in Jupyter4NFDI Hub), we must retrieve the binary below.
%%bash
# Check if '7z' is already available globally or in ~/bin
if ! command -v 7z >/dev/null 2>&1 && [ ! -x "$HOME/bin/7z" ]; then
echo "7z not found. Installing local copy..."
mkdir -p ~/bin && cd ~/bin
wget -q https://www.7-zip.org/a/7z2301-linux-x64.tar.xz
tar -xf 7z2301-linux-x64.tar.xz
ln -sf ~/bin/7zz ~/bin/7z
else
echo "7z is already available."
fi
Create a new release *.zip file
We want to create a ZIP file with the current release version in the name. We can get this with the following command:
!git describe --tags --abbrev=0
Create the release file:
%%bash
export PATH="$HOME/bin:$PATH" \
&& cd .. && git config --local --add safe.directory '*' \
&& RELEASE_VERSION=$(git describe --tags --abbrev=0) \
&& 7z a -tzip -mx=9 out/release_$RELEASE_VERSION.zip \
py/* out/* md/* resources/* *.bib notebooks/*.ipynb \
*.md *.yml *.ipynb nbconvert.tpl conf.json pyproject.toml \
-x!out/user_classification.csv -x!out/user_home_locations.csv \
-x!py/__pycache__ -x!py/modules/__pycache__ -x!py/modules/.ipynb_checkpoints \
-y > /dev/null
export PATH="$HOME/bin:$PATH"This ensures that any binary (e.g.7z) previously retrieved is accessible (see cell above)git config --local --add safe.directory '*'Ensures Git doesn't prompt for confirmation when working with repositories owned by different users.RELEASE_VERSIONis the bash variable that holds the value.7z a -tzip -mx=9 out/release_$RELEASE_VERSION.zipUses the7zarchiving tool (7z a) to create a ZIP archive (-tzip) namedout/release_followed by the retrieved version number ($RELEASE_VERSION).-mx=9sets the compression level to maximum.- With
py/* out/* resources/* notebooks/*.ipynb(etc.) we explicitly select the folders and files that we want to include in the release. Note that we explicitly include the00_data/directory, which is not committed to the git repository itself (due to the.gitignorefile). -x!out/user_classification.csv -x!out/user_home_locations.csvexcludes some of the intermediate data files that are not strictly necessary to replicate the final map, for increased privacy preservation- At the end, we exclude a number of temporary files that we do not need to archive (
-x!py/__pycache__ -x!py/modules/__pycache__etc.) and turn off any output logging by piping to/dev/null.
Next, we check the generated file:
!RELEASE_VERSION=$(git describe --tags --abbrev=0) \
&& ls -alh ../out/release_$RELEASE_VERSION.zip
For the ioerDATA upload, we also generate a list of the files below.
ignore_files_folders = []
ignore_match = []
tools.tree(
Path.cwd().parents[0],
ignore_files_folders=ignore_files_folders, ignore_match=ignore_match)
Create notebook HTML¶
!jupyter nbconvert --to html_toc \
--output-dir=../resources/html/ ./03_visualization.ipynb \
--template=../nbconvert.tpl \
--ExtractOutputPreprocessor.enabled=False >&- 2>&- # create single output file