Cherry blossoms opportunistic occurence data (Flickr, Twitter)

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024

No description has been provided for this image
•••
Out[65]:

Last updated: Jul-19-2024, Carto-Lab Docker Version 0.14.0

Visualizations of temporal patterns for ephemeral events. This notebook builds upon a a previous notebook.

The basis for this notebook are 1.6 Million Tweets that were queried with cherry AND (flower OR blossom OR sakura). The query contains both geotagged and non-geotagged content.

A second query with the same syntax was prepared for Flickr. For Flickr, only geotagged content was considered. This selected 100662 Flickr photographs.

The query period covered for both was 2007 to 2017.

Preparations

•••
In [67]:
OUTPUT = Path.cwd().parents[0] / "out"       # output directory for figures (etc.)
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
In [68]:
OUTPUT.mkdir(exist_ok=True)
(OUTPUT / "figures").mkdir(exist_ok=True)
(OUTPUT / "svg").mkdir(exist_ok=True)
WORK_DIR.mkdir(exist_ok=True)
In [69]:
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Select M for monthly aggregation, Y for yearly aggregation

In [70]:
AGG_BASE = "M"

First, define whether to study usercount or postcount

In [71]:
# METRIC = 'user'
METRIC = 'post'
In [72]:
metric_col = 'post_hll'
if METRIC == 'user':
    metric_col = 'user_hll'

Set global font

In [73]:
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

Set global colors

In [74]:
color_flickr = '#F89F5E'
color_twitter = '#85C87A'

Load HLL aggregate data

Load the data from CSV, generated in the previous notebook. Data is stored as aggregate HLL data (postcount, usercount) for each month.

•••
In [76]:
%%time
data_files = {
    "FLICKR_CHERRY_ALL":FLICKR_CHERRY_ALL, 
    }
tools.display_file_stats(data_files)
name FLICKR_CHERRY_ALL
size 180.23 KB
records 188
CPU times: user 4.14 ms, sys: 107 µs, total: 4.24 ms
Wall time: 3.8 ms
In [77]:
pd.read_csv(FLICKR_CHERRY_ALL, nrows=10)
Out[77]:
year month post_hll user_hll
0 2007 1 \x138b400023038106e109820b210b41124115c116411b... \x138b400ea10f441ba5200128a12a623a233be15c4265...
1 2007 2 \x138b4004a30ae22d612e83308241a14ea25021504253... \x138b4019812262246133413d423d8140a445635dc25f...
2 2007 3 \x148b400040108c0100000100010040110c6000c00000... \x138b400327058206610a220aa30b860ba70cc10ce30f...
3 2007 4 \x148b401046008c2008c02084441880008cc100820108... \x138b40000101610183030103a303e20482050305e106...
4 2007 5 \x138b40010301e304c404e40546066209a10fa2114312... \x138b4000a3042106c115c1182319031a821d241e0121...
5 2007 6 \x138b4007e608e1112216e123a136a13d0242044a6153... \x138b401b471d242561554358425a816e2373c295c2a0...
6 2007 7 \x138b4006c10d02134115a11a211a811c831cc3238223... \x138b4009410d4722c124a33123414442e14481548159...
7 2007 8 \x138b40176153058ac39de1ad21ae44bac1c301c8c1c9... \x138b402b414ea16b6172e77a219683aec2be22c7a1d681
8 2007 9 \x138b4003e2088240615d2160626122624271e178e584... \x138b400a6719c223812f6158e28725a601aee3b8e2d4...
9 2007 10 \x138b400be117421da12b453e62612181829aa1a342b9... \x138b400402182546a1718186019781aac1b541bf63c2...

Connect hll worker db

For calculating with HLL data, we are using an empty Postgres database with the Citus HLL extension installed. Specifically, we are using pg-hll-empty here, a Docker-postgres container that is prepared for HLL calculation. You can use any Postgres from anywhere, as long as it has the citus hll extension installed.

If you haven't, startup the container locally next to Carto-Lab Docker now:

cd pg-hll-empty
docker compose up -d
```up -d
In [78]:
DB_USER = "hlluser"
DB_PASS = os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST = "127.0.0.1"
DB_PORT = "5452"
DB_NAME = "hllworkerdb"

Connect to empty Postgres database running HLL Extension:

In [79]:
DB_CONN = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT ,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASS
)
DB_CONN.set_session(
    readonly=True)
DB_CALC = tools.DbConn(
    DB_CONN)
CUR_HLL = DB_CONN.cursor()

test

Calculate HLL Cardinality per month and year

Define additional functions for reading and formatting CSV as pd.DataFrame

In [80]:
from datetime import datetime

def read_csv_datetime(csv: Path) -> pd.DataFrame:
    """Read CSV with parsing datetime index (months)
    
        First CSV column: Year
        Second CSV column: Month
    """
    date_cols = ["year", "month"]
    df = pd.read_csv(
        csv, index_col='datetime', 
        parse_dates={'datetime':date_cols},
        date_format='%Y %m',
        keep_date_col='False')
    df.drop(columns=date_cols, inplace=True)
    return df
    
def append_cardinality_df(df: pd.DataFrame, hll_col: str = "post_hll", cardinality_col: str = 'postcount_est'):
    """Calculate cardinality from HLL and append to extra column in df"""
    df[cardinality_col] = df.apply(
        lambda x: hll.cardinality_hll(
           x[hll_col], CUR_HLL),
        axis=1)
    df.drop(columns=[hll_col], inplace=True)
    return df

def filter_fill_time(
        df: pd.DataFrame, min_year: int, 
        max_year: int, val_col: str = "postcount_est",
        min_month: str = "01", max_month: str = "01", agg_base: str = None,
        agg_method = None):
    """Filter time values between min - max year and fill missing values"""
    max_day = "01"
    if agg_base is None:
        agg_base = "M"
    elif agg_base == "Y":
        max_month = "12"
        max_day = "31"
    min_date = pd.Timestamp(f'{min_year}-{min_month}-01')
    max_date = pd.Timestamp(f'{max_year}-{max_month}-{max_day}')
    # clip by start and end date
    if not min_date in df.index:
        df.loc[min_date, val_col] = 0
    if not max_date in df.index:
        df.loc[max_date, val_col] = 0
    df.sort_index(inplace=True)
    # mask min and max time
    time_mask = ((df.index >= min_date) & (df.index <= max_date))
    resampled = df.loc[time_mask][val_col].resample(agg_base)
    if agg_method is None:
        series = resampled.sum()
    elif agg_method == "count":
        series = resampled.count()
    elif agg_method == "nunique":
        series = resampled.nunique()
    # fill missing months with 0
    # this will also set the day to max of month
    return series.fillna(0).to_frame()

Select dataset to process below

Apply functions to all data sets.

  • Read from CSV
  • calculate cardinality
  • merge year and month to single column
  • filter 2007 - 2018 range, fill missing values
In [81]:
def process_dataset(
        dataset: Path = None, metric: str = None, df_post: pd.DataFrame = None,
        min_year: int = None, max_year: int = None, agg_base: str = None) -> pd.DataFrame:
    """Apply temporal filter/pre-processing to all data sets."""
    if metric is None:
        metric = 'post_hll'
        warn(f"Using default value {metric}")
    if metric == 'post_hll':
        cardinality_col = 'postcount_est'
    else:
        cardinality_col = 'usercount_est'
    if min_year is None:
        min_year = 2007
    if max_year is None:
        max_year = 2017
    if df_post is None:
        df_post = read_csv_datetime(dataset)
    df_post = append_cardinality_df(df_post, metric, cardinality_col)
    return filter_fill_time(df_post, min_year, max_year, cardinality_col, agg_base=agg_base)
In [82]:
%%time
df_post = process_dataset(FLICKR_CHERRY_ALL, agg_base=AGG_BASE, metric='post_hll')
CPU times: user 14.2 ms, sys: 4.66 ms, total: 18.8 ms
Wall time: 40.1 ms
In [83]:
df_post.head(5)
Out[83]:
postcount_est
datetime
2007-01-31 86
2007-02-28 51
2007-03-31 1078
2007-04-30 1852
2007-05-31 186
In [84]:
%%time
df_user = process_dataset(FLICKR_CHERRY_ALL, metric=metric_col, agg_base=AGG_BASE)
CPU times: user 12.9 ms, sys: 5.63 ms, total: 18.5 ms
Wall time: 39.5 ms
In [85]:
df_user.head(5)
Out[85]:
postcount_est
datetime
2007-01-31 86
2007-02-28 51
2007-03-31 1078
2007-04-30 1852
2007-05-31 186

Visualize Cardinality

Flickr

Define plot function.

In [86]:
def fill_plot_time(
        df: pd.DataFrame, ax: Axes, color: str,
        label: str, val_col: str = "postcount_est", fill: bool = None) -> Axes:
    """Matplotlib Barplot with time axis formatting

    If "significant" in df columns, applies different colors to fill/edge
    of non-significant values.
    """
    if fill is None:
        fill = True
    if color is None:
        colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
        color_rgba = colors([1.0])[0]
        color = mcolor.rgb2hex((color_rgba), keep_alpha=True)
    color_significant = color
    color_significant_edge = "white"
    if "significant" in df.columns:
        colors_bar = {True: color, False: "white"}
        color_significant = df['significant'].replace(colors_bar)
        colors_edge = {True: "white", False: "black"}
        color_significant_edge = df['significant'].replace(colors_edge)
    df_plot = df.set_index(
        df.index.map(lambda s: s.strftime('%Y')))
    ax = df_plot.plot(
            ax=ax, y=val_col, color=color_significant,
            label=label, linewidth=0.5, alpha=1.0) #0.6
    if fill:
        ax.fill_between(range(len(df_plot.index)), df_plot[val_col], facecolor=color, alpha=1.0)
    return ax

@ticker.FuncFormatter
def major_formatter(x, pos):
    s = format(x, '.0f')
    if len(s) <= 4:
        return s
    return format(x, ',.0f')

def plot_time(
        df: Tuple[pd.DataFrame, pd.DataFrame], title, color = None, filename = None, 
        output = OUTPUT, legend: str = "Postcount", val_col: str = None,
        trend: bool = None, seasonal: bool = None, residual: bool = None,
        agg_base: str = None, fig = None, ax = None, return_fig_ax = None, fill: bool = None):
    """Create dataframe(s) time plot"""
    x_ticks_every = 12
    fig_x = 10
    fig_y = 2
    font_mod = True
    x_label = "Year"
    linewidth = 3
    if agg_base and agg_base == "Y":
        x_ticks_every = 1
        fig_x = 3
        fig_y = 1.5
        font_mod = True
        x_label = "Year"
        linewidth = 1
    if fig is None or ax is None:
        fig, ax = plt.subplots()
        fig.set_size_inches(fig_x, fig_y)
    ylabel = f'{legend}'
    if val_col is None:
        val_col = f'{legend.lower()}_est'
    ax = fill_plot_time(
        df=df, ax=ax, color=color, val_col=val_col, label=legend, fill=fill)

    # TODO: below is a bit hacky way to format the x-axis;
    tick_loc = mticker.MultipleLocator(x_ticks_every)
    ax.xaxis.set_major_locator(tick_loc)
    ax.tick_params(axis='x', rotation=45, length=0.5)
    ax.yaxis.set_major_formatter(major_formatter)
    xrange_min_max = range(2006, 2018)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        ax.set_xticklabels(xrange_min_max, rotation=45)
    
    ax.set(xlabel=x_label, ylabel=ylabel)
    ax.spines["left"].set_linewidth(0.25)
    ax.spines["bottom"].set_linewidth(0.25)
    ax.spines["top"].set_linewidth(0)
    ax.spines["right"].set_linewidth(0)
    ax.yaxis.set_tick_params(width=0.5)
    # remove legend
    ax.get_legend().remove()
    ax.set_title(title)
    ax.set_xlim(-0.5, len(df)-0.5)
    ax.set_ylim(bottom=0)
    if font_mod:
        for item in (
            [ax.xaxis.label, ax.title, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
                item.set_fontsize(8)
    # store figure to file
    if filename:
        fig.savefig(
            output / "figures" / f"{filename}.png", dpi=300, format='PNG',
            bbox_inches='tight', pad_inches=1, facecolor="white")
        # also save as svg
        fig.savefig(
            output / "svg" / f"{filename}.svg", format='svg',
            bbox_inches='tight', pad_inches=1, facecolor="white")
    if return_fig_ax:
        return fig, ax
In [87]:
def load_and_plot(
        dataset: Path = None, metric: str = None, src_ref: str = "flickr", colors: cm.colors.ListedColormap = None,
        agg_base: str = None, trend: bool = None, return_df: bool = None, df_post: pd.DataFrame = None, return_fig_ax = None):
    """Load data and plot"""
    if metric is None:
        metric = 'post_hll'
    if metric == 'post_hll':
        metric_label = 'postcount'
    else:
        metric_label = 'usercount'
    if colors is None:
        colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
        colors = colors([1.0])
    df = process_dataset(dataset, metric=metric, agg_base=agg_base, df_post=df_post)
    fig, ax = plot_time(
        df, legend=metric_label.capitalize(), color=colors,
        title=f'{src_ref}', 
        filename=f"temporal_{metric_label}_{src_ref}_absolute", trend=trend, agg_base=agg_base, return_fig_ax=True)
    fig.show()
    if return_fig_ax:
        return fig, ax
    if return_df:
        return df
In [88]:
colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
In [89]:
fig, ax = load_and_plot(
    FLICKR_CHERRY_ALL, src_ref=f" ", agg_base=AGG_BASE, trend=False, metric=metric_col, return_fig_ax=True,
    colors=color_flickr)
fig.show()
No description has been provided for this image

Twitter

In [90]:
src = Path.cwd().parents[0] / "00_data" / "twitter" / "2017-04-12_CherryTweetsWorldwide_2007to2017-01-01.csv"
In [91]:
load_twitter_kwargs = {
    "filepath_or_buffer":src,
    "index_col":'datetime', 
    "parse_dates":{'datetime':["timestamp"]},
    # "date_format":'%Y-%m-%d',
    "keep_date_col":'False',
    "usecols":["id", "timestamp"]
}
df = pd.read_csv(**load_twitter_kwargs)
In [92]:
df.drop(columns=['timestamp'], inplace=True)
In [93]:
df.head()
Out[93]:
id
datetime
2007-01-22 08:49:12 3665663
2007-02-17 20:39:46 5560411
2007-03-13 17:42:45 7462541
2007-03-23 04:40:10 11375161
2007-03-24 01:52:51 11860891

Aggregate

In [94]:
val_col = "id"
agg_method = "count"
metric_label="observations"
In [95]:
df_twitter = filter_fill_time(
    df, 2007, 2017, val_col=val_col, agg_base=AGG_BASE, agg_method=agg_method)
In [96]:
df_twitter.rename(columns={val_col: metric_label}, inplace=True)
In [97]:
src_ref=" "
In [98]:
df_twitter.sum()
Out[98]:
observations    1599167
dtype: int64
In [99]:
df_twitter.head()
Out[99]:
observations
datetime
2007-01-31 2
2007-02-28 1
2007-03-31 14
2007-04-30 64
2007-05-31 3

instantiate a second axis that shares the same x-axis

In [100]:
ax2 = ax.twinx()
ax2.set_zorder(ax.get_zorder() - 1)
ax.patch.set_visible(False)
ax.set_ylabel('Observations Flickr')
Out[100]:
Text(20.0, 0.5, 'Observations Flickr')
In [101]:
fig, ax = plot_time(
        df_twitter, legend=metric_label.capitalize(), color=color_twitter,
        title=" ", val_col=metric_label,
        filename=f"temporal_{metric_label}_{src_ref}_absolute", trend=False, agg_base=AGG_BASE, fig=fig, ax=ax2, return_fig_ax=True)
ax2.set_ylabel('Observations Twitter')
ax2.set_title("")
Out[101]:
Text(0.5, 1.0, '')
In [102]:
range_patch = mpatches.Patch(
    color=color_twitter,
    label='Twitter', alpha=1.0)
obs_patch = mpatches.Patch(
    color=color_flickr,
    label='Flickr', alpha=1.0)
legend_entries = [range_patch, obs_patch]
legend_kwds = {
    "bbox_to_anchor": (0.0, 1),
    "loc":'upper left',
    "fontsize":8, "frameon":False,
    "alignment":"left"}
ax2.legend(handles=legend_entries, **legend_kwds)
Out[102]:
<matplotlib.legend.Legend at 0x7fb443667fa0>
In [103]:
fig
Out[103]:
No description has been provided for this image
In [104]:
tools.save_fig(fig, output=OUTPUT, name="cherry_flickr_twitter")

Create notebook HTML

In [1]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./06_cherry_blossoms.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

IOER RDC Jupyter Base Template v0.10.0