100 Million places visualized with geopandas, datashader and geoparquet

Alexander Dunkel*, Dominik Weckmüller**
*Leibniz Institute of Ecological Urban and Regional Development, Transformative Capacities & Research Data Centre (IÖR-FDZ); **TU Dresden, Institute of Cartography

No description has been provided for this image

TL;DR Visualization of 100 Million places posts from Foursquare with geoparquet and datashader.

Sources:

Some of the code in this notebook is based on two previous explorations:


•••
Out[1]:

Last updated: Nov-25-2024

Preparations

Use the lbsn Jupyterlab Docker container as a starting point.

Also, you need to install either pyarrow or fastparquet (recommended) from conda-forge

In [2]:
import numpy as np
import pandas as pd
import geopandas as gp
import dask.diagnostics as diag
import datashader.transfer_functions as tf
import datashader as ds
from datashader.utils import lnglat_to_meters
from IPython.display import clear_output
from pathlib import Path

Preparations

In [3]:
OUTPUT = Path.cwd().parents[0] / "out"
OUTPUT.mkdir(exist_ok=True)
In [4]:
INPUT = Path.cwd().parents[0] / "00_data"

Load Parquet Format

The parquet format stores data indo individual files (partitions) and allows streamed processing in (e.g.) dask. See a comparison of supported data types for datashader here.

In [5]:
parquet_input = INPUT / "foursquare_places.parquet"
In [6]:
datasize = parquet_input.stat().st_size/(1024*1024*1024)
print(
    f"Size: {datasize:,.1f} GB")
Size: 9.9 GB

Read the parquet file, but limit reading to column geometry. This will read the full dataset, no chunking or streamed processing. Keep an eye on your memory.

In [7]:
df = gp.read_parquet(
    parquet_input, columns=["geometry"])

Project to web mercator

In [8]:
gdf = df.to_crs(3857)
In [9]:
gdf.head()
Out[9]:
geometry
0 POINT (12348830.637 -890188.734)
1 POINT (12281917.323 -880747.774)
2 POINT (12278402.216 -894917.756)
3 POINT (12286748.849 -884964.258)
4 POINT (12275093.383 -887669.909)

Visualize with datashader

Define areas

In [10]:
def bounds(x_range, y_range):
    x,y = lnglat_to_meters(x_range, y_range)
    return dict(x_range=x, y_range=y)

Earth = ((-180.00, 180.00), (-59.00, 74.00))
France = ((-12.00,  16.00), (41.26, 51.27))
Berlin = ((12.843018,  14.149704), (52.274880, 52.684292))
USA = ((-126,  -64), (24.92, 49.35))
Paris = ((2.05, 2.65), (48.76, 48.97))
Dresden = ((13.415680, 14.703827), (50.740090, 51.194905))
DE = ((4.605469, 15.372070), (46.697243, 55.065885))

plot_width = 1000
plot_height = int(plot_width*0.5)
In [11]:
canvas = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds(*Earth))
In [12]:
with diag.ProgressBar(), diag.Profiler() as prof, diag.ResourceProfiler(0.5) as rprof:
    agg = canvas.points(gdf, geometry="geometry")
In [13]:
tf.shade(agg.where(agg > 1), cmap=["lightblue", "darkblue"])
Out[13]:
No description has been provided for this image
In [15]:
def plot(x_range, y_range, plot_width: int = None):
    """Plot df using tf-shade()
    Calculates aspect ratio based on 
    web mercator distance ratio lat/lng
    """
    if plot_width is None:
        plot_width = 1000
    lng_width = x_range[1]-x_range[0]
    lat_height = y_range[1]-y_range[0]
    plot_height = int(((plot_width * lat_height) / lng_width)*1.5)
    cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds(x_range, y_range))
    with diag.ProgressBar(), diag.Profiler() as prof, diag.ResourceProfiler(0.5) as rprof:
        agg = cvs.points(gdf, geometry="geometry")
    return tf.shade(agg, cmap=["lightblue", "darkblue"])

def save_image(img, output_name, return_img: bool = True, ):
    """Saves image as PNG"""
    ds.utils.export_image(
        img=img, filename= str(OUTPUT / output_name), fmt=".png", background='white')
    if return_img:
        return img
In [16]:
%time save_image(plot(*France), output_name='France_map')
CPU times: user 14.5 s, sys: 1.43 s, total: 16 s
Wall time: 17.6 s
Out[16]:
No description has been provided for this image
In [17]:
%time save_image(plot(*Paris), output_name='Paris_map')
CPU times: user 503 ms, sys: 500 ms, total: 1 s
Wall time: 2.21 s
Out[17]:
No description has been provided for this image
In [18]:
%time save_image(plot(*Berlin), output_name='Berlin_map')
CPU times: user 394 ms, sys: 508 ms, total: 901 ms
Wall time: 2.16 s
Out[18]:
No description has been provided for this image
In [19]:
%time save_image(plot(*USA), output_name='USA_map')
CPU times: user 24.9 s, sys: 2.3 s, total: 27.2 s
Wall time: 28.8 s
Out[19]:
No description has been provided for this image
In [20]:
%time save_image(plot(*DE), output_name='DE_map')
CPU times: user 8.27 s, sys: 1.09 s, total: 9.36 s
Wall time: 11 s
Out[20]:
No description has been provided for this image
In [21]:
%time save_image(plot(*Dresden), output_name='Dresden_map')
CPU times: user 111 ms, sys: 523 ms, total: 635 ms
Wall time: 2.19 s
Out[21]:
No description has been provided for this image

Create Notebook HTML

In [22]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./2024-11-25_foursquare_parquet.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&- # create single output file
In [ ]:
 

IOER RDC Jupyter Base Template v0.10.0