Mass invasions: Temporal patterns for 13 Instagram vantage points

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024

No description has been provided for this image
•••
Out[1]:

Last updated: Jul-19-2024, Carto-Lab Docker Version 0.14.0

Visualizations of temporal patterns for places affected by "mass invasion". The list of 13 places was prepared by Claudia Tautenhahn for her master's thesis.

Tautenhahn, C. Das Phänomen der Masse: Landschaftliche Wirkungen sozialer Medien. Master's thesis. TU Dresden 2020, Department of Landscape Architecture and Environmental Planning

I was the second advisor of this thesis and performed data collection. Based on Claudia's list, 1.5 Million Instagram Posts were queried. The query period covered was 2010 to 2019, but the lower analysis bound was limited to 2015 due to low data availability in early years. This notebook uses the originally collected data to create visualizations based on privacy-aware HyperLogLog cardinality estimation.

Preparations

•••
In [3]:
OUTPUT = Path.cwd().parents[0] / "out"       # output directory for figures (etc.)
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
In [4]:
OUTPUT.mkdir(exist_ok=True)
(OUTPUT / "figures").mkdir(exist_ok=True)
(OUTPUT / "svg").mkdir(exist_ok=True)
WORK_DIR.mkdir(exist_ok=True)
In [5]:
%load_ext autoreload
%autoreload 2

Select M for monthly aggregation, Y for yearly aggregation

In [6]:
AGG_BASE = "M"

First, define whether to study usercount or postcount

In [7]:
# METRIC = 'user'
METRIC = 'post'
In [8]:
metric_col = 'post_hll'
if METRIC == 'user':
    metric_col = 'user_hll'

Set global font

In [9]:
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']
In [10]:
color_instagram = '#737EBD'

Load HLL aggregate data

Load the data from CSV, generated in the previous notebook. Data is stored as aggregate HLL data (postcount, usercount) for each month.

•••
In [12]:
%%time
data_files = {
    "MASS_INVASION_ALL":MASS_INVASION_ALL, 
    }
tools.display_file_stats(data_files)
name MASS_INVASION_ALL
size 2.03 MB
records 735
CPU times: user 21.8 ms, sys: 2.93 ms, total: 24.7 ms
Wall time: 24.1 ms
In [13]:
pd.read_csv(MASS_INVASION_ALL, nrows=10)
Out[13]:
year month post_hll user_hll topic_group
0 2015 1 \x138b400ec285429d01 \x138b4037018fa19de2 Bastei
1 2015 1 \x138b40022109420ba10cc11104154218421a811b031c... \x138b4003e10481100110e11481162519e61ba21be31c... Carrick-a-Rede
2 2015 1 \x138b40046105620ae10ec110251323146118e219a11d... \x138b40032906840c0113a116411d81290133e435a13b... Cliffs of Moher
3 2015 1 \x138b4002610ae10ba30ea215621e211e422c2236a338... \x138b400921162521413943434243c145815361546355... Dark Hedges
4 2015 1 \x138b401a42 \x138b404561 Fjadrargljufur
5 2015 1 \x138b4010e5b201f721 \x138b4041817061d281 Giant's Causeway
6 2015 1 \x138b400087058109810c8110e616e518611a411ba224... \x138b40038107620b6116e318821a611bc21c22234326... Lago di Braies
7 2015 1 \x138b4025812a2134c135823ea241845e8164e6762176... \x138b4016a21d2251215a628de294419ec1a0e1aea1bd... Preikestolen
8 2015 1 \x138b40074110611761184247c162c17381c001d501 \x138b407762cbe2d582d8a1e341e361e622e822 Trolltunga
9 2015 1 \x138b4008a11d8226433e825c035ec276a188038f219b... \x138b4001c30d2211612ec22fe19b429e81c5c2f689f8... Valle Verzasca

Connect hll worker db

For calculating with HLL data, we are using an empty Postgres database with the Citus HLL extension installed. Specifically, we are using pg-hll-empty here, a Docker-postgres container that is prepared for HLL calculation.

If you haven't, startup the container locally next to Carto-Lab Docker now:

cd pg-hll-empty
docker compose up -d
In [14]:
DB_USER = "hlluser"
DB_PASS = os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST = "127.0.0.1"
DB_PORT = "5452"
DB_NAME = "hllworkerdb"

Connect to empty Postgres database running HLL Extension:

In [15]:
DB_CONN = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT ,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASS
)
DB_CONN.set_session(
    readonly=True)
DB_CALC = tools.DbConn(
    DB_CONN)
CUR_HLL = DB_CONN.cursor()

test

Calculate HLL Cardinality per month and year

Define additional functions for reading and formatting CSV as pd.DataFrame

In [16]:
from datetime import datetime

def read_csv_datetime(csv: Path) -> pd.DataFrame:
    """Read CSV with parsing datetime index (months)
    
        First CSV column: Year
        Second CSV column: Month
    """
    date_cols = ["year", "month"]
    df = pd.read_csv(
        csv, index_col='datetime', 
        parse_dates={'datetime':date_cols},
        date_format='%Y %m',
        keep_date_col='False')
    df.drop(columns=date_cols, inplace=True)
    return df
    
def append_cardinality_df(df: pd.DataFrame, hll_col: str = "post_hll", cardinality_col: str = 'postcount_est'):
    """Calculate cardinality from HLL and append to extra column in df"""
    df[cardinality_col] = df.apply(
        lambda x: hll.cardinality_hll(
           x[hll_col], CUR_HLL),
        axis=1)
    df.drop(columns=[hll_col], inplace=True)
    return df

def filter_fill_time(
        df: pd.DataFrame, min_year: int, 
        max_year: int, val_col: str = "postcount_est",
        min_month: str = "01", max_month: str = "01", agg_base: str = None,
        agg_method = None):
    """Filter time values between min - max year and fill missing values"""
    max_day = "01"
    if agg_base is None:
        agg_base = "M"
    elif agg_base == "Y":
        max_month = "12"
        max_day = "31"
    min_date = pd.Timestamp(f'{min_year}-{min_month}-01')
    max_date = pd.Timestamp(f'{max_year}-{max_month}-{max_day}')
    # clip by start and end date
    if not min_date in df.index:
        df.loc[min_date, val_col] = 0
    if not max_date in df.index:
        df.loc[max_date, val_col] = 0
    df.sort_index(inplace=True)
    # mask min and max time
    time_mask = ((df.index >= min_date) & (df.index <= max_date))
    resampled = df.loc[time_mask][val_col].resample(agg_base)
    if agg_method is None:
        series = resampled.sum()
    elif agg_method == "count":
        series = resampled.count()
    elif agg_method == "nunique":
        series = resampled.nunique()
    # fill missing months with 0
    # this will also set the day to max of month
    return series.fillna(0).to_frame()

Select dataset to process below

Apply functions to all data sets.

  • Read from CSV
  • calculate cardinality
  • merge year and month to single column
  • filter 2007 - 2018 range, fill missing values
In [17]:
def process_dataset(
        dataset: Path = None, metric: str = None, df_post: pd.DataFrame = None,
        min_year: int = None, max_year: int = None, agg_base: str = None) -> pd.DataFrame:
    """Apply temporal filter/pre-processing to all data sets."""
    if metric is None:
        metric = 'post_hll'
        warn(f"Using default value {metric}")
    if metric == 'post_hll':
        cardinality_col = 'postcount_est'
    else:
        cardinality_col = 'usercount_est'
    if min_year is None:
        min_year = 2015
    if max_year is None:
        max_year = 2019
    if df_post is None:
        df_post = read_csv_datetime(dataset)
    df_post = append_cardinality_df(df_post, metric, cardinality_col)
    return filter_fill_time(df_post, min_year, max_year, cardinality_col, agg_base=agg_base)
In [18]:
%%time
df_post = process_dataset(MASS_INVASION_ALL, agg_base=AGG_BASE, metric='post_hll')
CPU times: user 35.7 ms, sys: 22.3 ms, total: 57.9 ms
Wall time: 138 ms
In [19]:
df_post.head(5)
Out[19]:
postcount_est
datetime
2015-01-31 349
2015-02-28 368
2015-03-31 493
2015-04-30 547
2015-05-31 825
In [20]:
%%time
df_user = process_dataset(MASS_INVASION_ALL, metric=metric_col, agg_base=AGG_BASE)
CPU times: user 25.2 ms, sys: 26.5 ms, total: 51.6 ms
Wall time: 130 ms
In [21]:
df_user.head(5)
Out[21]:
postcount_est
datetime
2015-01-31 349
2015-02-28 368
2015-03-31 493
2015-04-30 547
2015-05-31 825

Visualize Cardinality

Flickr

Define plot function.

In [22]:
def fill_plot_time(
        df: pd.DataFrame, ax: Axes, color: str,
        label: str, val_col: str = "postcount_est") -> Axes:
    """Matplotlib Barplot with time axis formatting

    If "significant" in df columns, applies different colors to fill/edge
    of non-significant values.
    """
    if color is None:
        colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
        color_rgba = colors([1.0])[0]
        color = mcolor.rgb2hex((color_rgba), keep_alpha=True)
    color_significant = color
    color_significant_edge = "white"
    if "significant" in df.columns:
        colors_bar = {True: color, False: "white"}
        color_significant = df['significant'].replace(colors_bar)
        colors_edge = {True: "white", False: "black"}
        color_significant_edge = df['significant'].replace(colors_edge)
    df_plot = df.set_index(
        df.index.map(lambda s: s.strftime('%Y')))
    ax = df_plot.plot(
            ax=ax, y=val_col, color=color_significant,
            label=label, linewidth=0.5, alpha=0.6)
    ax.fill_between(range(len(df_plot.index)), df_plot[val_col], facecolor=color, alpha=0.6)
    return ax

def replace_hyphen(x):
    """Change hyphen (-) into a minus sign (−, “U+2212”) python
    according to editor request
    """
    return x.replace('-','−')
    
@mticker.FuncFormatter
def major_formatter(x, pos):
    """Limit thousands separator (comma) to five or more digits
    according ton editor request
    """
    s = format(x, '.0f')
    if len(s) <= 4:
        return replace_hyphen(s)
    return replace_hyphen(format(x, ',.0f'))
    
def plot_time(
        df: Tuple[pd.DataFrame, pd.DataFrame], title, color = None, filename = None, 
        output = OUTPUT, legend: str = "Postcount", val_col: str = None,
        trend: bool = None, seasonal: bool = None, residual: bool = None,
        agg_base: str = None, fig = None, ax = None, return_fig_ax = None):
    """Create dataframe(s) time plot"""
    x_ticks_every = 12
    fig_x = 10
    fig_y = 2
    font_mod = True
    x_label = "Year"
    linewidth = 3
    if agg_base and agg_base == "Y":
        x_ticks_every = 1
        fig_x = 3
        fig_y = 1.5
        font_mod = True
        x_label = "Year"
        linewidth = 1
    if fig is None or ax is None:
        fig, ax = plt.subplots()
        fig.set_size_inches(fig_x, fig_y)
    ylabel = f'{legend}'
    if val_col is None:
        val_col = f'{legend.lower()}_est'
    ax = fill_plot_time(
        df=df, ax=ax, color=color, val_col=val_col, label=legend)

    # TODO: below is a bit hacky way to format the x-axis;
    tick_loc = mticker.MultipleLocator(x_ticks_every)
    ax.xaxis.set_major_locator(tick_loc)
    ax.tick_params(axis='x', rotation=45, length=0.5)
    ax.yaxis.set_major_formatter(major_formatter)
    xrange_min_max = range(2015, 2021)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        ax.set_xticklabels(xrange_min_max, rotation=45)
    
    ax.set(xlabel=x_label, ylabel=ylabel)
    ax.spines["left"].set_linewidth(0.25)
    ax.spines["bottom"].set_linewidth(0.25)
    ax.spines["top"].set_linewidth(0)
    ax.spines["right"].set_linewidth(0)
    ax.yaxis.set_tick_params(width=0.5)
    # remove legend
    ax.get_legend().remove()
    ax.set_title(title)
    ax.set_xlim(-0.5, len(df)-0.5)
    ax.set_ylim(bottom=0)
    if font_mod:
        for item in (
            [ax.xaxis.label, ax.title, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
                item.set_fontsize(8)
    # store figure to file
    if filename:
        fig.savefig(
            output / "figures" / f"{filename}.png", dpi=300, format='PNG',
            bbox_inches='tight', pad_inches=1, facecolor="white")
        # also save as svg
        fig.savefig(
            output / "svg" / f"{filename}.svg", format='svg',
            bbox_inches='tight', pad_inches=1, facecolor="white")
    if return_fig_ax:
        return fig, ax
In [23]:
def load_and_plot(
        dataset: Path = None, metric: str = None, src_ref: str = "flickr", colors: cm.colors.ListedColormap = None,
        agg_base: str = None, trend: bool = None, return_df: bool = None, df_post: pd.DataFrame = None, return_fig_ax = None):
    """Load data and plot"""
    if metric is None:
        metric = 'post_hll'
    if metric == 'post_hll':
        metric_label = 'postcount'
    else:
        metric_label = 'usercount'
    if colors is None:
        colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
        colors = colors([1.0])
    df = process_dataset(dataset, metric=metric, agg_base=agg_base, df_post=df_post)
    fig, ax = plot_time(
        df, legend=metric_label.capitalize(), color=colors,
        title=f'{src_ref}', 
        filename=f"temporal_{metric_label}_{src_ref}_absolute", trend=trend, agg_base=agg_base, return_fig_ax=True)
    fig.show()
    if return_fig_ax:
        return fig, ax
    if return_df:
        return df
In [24]:
colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
In [25]:
fig, ax = load_and_plot(MASS_INVASION_ALL, src_ref=f"Mass invasion total posts 2015-2019, all 13 places", agg_base=AGG_BASE, trend=False, metric=metric_col, return_fig_ax=True)
fig.show()
No description has been provided for this image

Visualize individual places

Monthly

In [26]:
df = pd.read_csv(MASS_INVASION_ALL)
df.tail()
Out[26]:
year month post_hll user_hll topic_group
729 2019 10 \x138b40018101c3032109e20b810c210fa2114512e114... \x138b4000c10442054107e20a41108113a114c515a716... Rakotzbrücke
730 2019 10 \x138b40004400820161048104c304e20661078108a108... \x138b4000610241028103a105e706a1072107a207c207... Trolltunga
731 2019 10 \x138b4000240081020503c106e407e1084109620a020a... \x138b4000a1012102c103a20403044105c10721084108... Valle Verzasca
732 2019 11 \x148b4000040000600000000020008250000100000080... \x148b4000401180410880000400004400000000020000... Bastei
733 2019 11 \x138b40004104e206c2120115e118881c221ec4216122... \x138b4007a20ae20f621164128113e123e2304241614f... Cliffs of Moher
In [27]:
mass_invasion_places = df['topic_group'].unique()
In [28]:
mass_invasion_places
Out[28]:
array(['Bastei', 'Carrick-a-Rede', 'Cliffs of Moher', 'Dark Hedges',
       'Fjadrargljufur', "Giant's Causeway", 'Lago di Braies',
       'Preikestolen', 'Trolltunga', 'Valle Verzasca', 'Obersee',
       'Rakotzbrücke', 'Hängebrücke Geierlay'], dtype=object)
In [29]:
def plot_bars(
        df: pd.DataFrame, ax: matplotlib.axes = None, title: str = None):
    """Plot bars from two DataFrames"""
    bar_param = {
        "width":1.0,
        "label":"Mass invasion total post count aggregated for months",
        "edgecolor":"white",
        "linewidth":0.5,
        "alpha":1.0
    }
    # create figure
    if not ax:
        fig, ax = plt.subplots(1, 1, figsize=(3, 1.5))
    # plot
    df.groupby(df.index.month)["postcount_est"] \
        .mean().plot.bar(ax=ax, color=color_instagram, y="postcount_est", **bar_param)
    # format
    ax.set_xlim(-0.5,11.5)
    month_names = ['Jan','Feb','Mar','Apr','May','Jun',
                   'Jul','Aug','Sep','Oct','Nov','Dec'] 
    ax.set_xticklabels(month_names)
    ax.tick_params(axis='x', rotation=45, length=0) # length: of ticks
    ax.spines["left"].set_linewidth(0.25)
    ax.spines["bottom"].set_linewidth(0.25)
    ax.spines["top"].set_linewidth(0)
    ax.spines["right"].set_linewidth(0)
    ax.yaxis.set_tick_params(width=0.5)
    ax.set(xlabel="", ylabel="")
    if not title:
        title = "Post Count per Month (mean)"
    ax.set_title(title, y=-0.2, pad=-14)
    for item in (
        [ax.xaxis.label, ax.title, ax.yaxis.label] +
         ax.get_xticklabels() + ax.get_yticklabels()):
            item.set_fontsize(8)
In [30]:
top_50_cnt = 13
top_50 = mass_invasion_places.tolist()
In [31]:
top_50.sort()
In [32]:
top_50
Out[32]:
['Bastei',
 'Carrick-a-Rede',
 'Cliffs of Moher',
 'Dark Hedges',
 'Fjadrargljufur',
 "Giant's Causeway",
 'Hängebrücke Geierlay',
 'Lago di Braies',
 'Obersee',
 'Preikestolen',
 'Rakotzbrücke',
 'Trolltunga',
 'Valle Verzasca']
In [33]:
df_post = read_csv_datetime(MASS_INVASION_ALL)
# create figure object with multiple subplots
fig, axes = plt.subplots(nrows=int(top_50_cnt/3), ncols=4, figsize=(14, 11))
fig.subplots_adjust(hspace=.5) # adjust vertical space, to allow title below plot
# iterate places
for ix, ax in enumerate(axes.reshape(-1)):
    if ix >= len(top_50):
        ax.set_axis_off()
        continue
    np_str = top_50[ix]
    # filter np_str and calculate cardinality
    df_post_filter = df_post[df_post["topic_group"]==np_str].copy()
    df_post_filter = append_cardinality_df(df_post_filter, 'post_hll', 'postcount_est')
    df_post_filter = filter_fill_time(df_post_filter, 2010, 2023, 'postcount_est')
    # plot bars individually
    plot_bars(
        df=df_post_filter, title=np_str, ax=ax)
No description has been provided for this image
In [34]:
tools.save_fig(fig, output=OUTPUT, name="barplot_massinvasion_seasonal")

Yearly

In [35]:
def bar_plot_time(
        df: pd.DataFrame, ax: Axes, color: str,
        label: str, val_col: str = "postcount_est") -> Axes:
    """Matplotlib Barplot with time axis formatting

    If "significant" in df columns, applies different colors to fill/edge
    of non-significant values.
    """
    if color is None:
        color = color_instagram
    color_significant = color
    color_significant_edge = "white"
    if "significant" in df.columns:
        colors_bar = {True: color, False: "white"}
        color_significant = df['significant'].replace(colors_bar)
        colors_edge = {True: "white", False: "black"}
        color_significant_edge = df['significant'].replace(colors_edge)
    bar_param = {
        "width":1.0,
        "label":label,
        "edgecolor":"white",
        "linewidth":0.5,
        "alpha":1.0
    }
    ax = df.set_index(
         df.index.map(lambda s: s.strftime('%Y'))).plot.bar(
            ax=ax, color=color_instagram, y="postcount_est", **bar_param)
    return ax

def plot_time(
        df: Tuple[pd.DataFrame, pd.DataFrame], title, color = None, filename = None, 
        output = OUTPUT, legend: str = "Postcount", val_col: str = None,
        trend: bool = None, seasonal: bool = None, residual: bool = None,
        agg_base: str = None, ax = None):
    """Create dataframe(s) time plot"""
    x_ticks_every = 12
    fig_x = 15.7
    fig_y = 4.27
    font_mod = False
    x_label = ""
    linewidth = 3
    if agg_base and agg_base == "Y":
        x_ticks_every = 1
        fig_x = 3
        fig_y = 1.5
        font_mod = True
        x_label = ""
        linewidth = 1
    if ax is None:
        fig, ax = plt.subplots()
        fig.set_size_inches(fig_x, fig_y)
    ylabel = ""
    if val_col is None:
        val_col = f'{legend.lower()}_est'
    ax = bar_plot_time(
        df=df, ax=ax, color=color, val_col=val_col, label=legend)
    # format
    tick_loc = mticker.MultipleLocator(x_ticks_every)
    ax.xaxis.set_major_locator(tick_loc)
    ax.tick_params(axis='x', rotation=45, length=0)
    ax.yaxis.set_major_formatter(major_formatter)
    ax.set(xlabel=x_label, ylabel=ylabel)
    ax.spines["left"].set_linewidth(0.25)
    ax.spines["bottom"].set_linewidth(0.25)
    ax.spines["top"].set_linewidth(0)
    ax.spines["right"].set_linewidth(0)
    ax.yaxis.set_tick_params(width=0.5)
    # remove legend
    ax.get_legend().remove()
    ax.set_title(title, y=-0.2, pad=-14)
    ax.set_xlim(-0.5,len(df)-0.5)
    for item in (
        [ax.xaxis.label, ax.title, ax.yaxis.label] +
         ax.get_xticklabels() + ax.get_yticklabels()):
            item.set_fontsize(8)
In [36]:
df_post = read_csv_datetime(MASS_INVASION_ALL)
# create figure object with multiple subplots
fig, axes = plt.subplots(nrows=int(top_50_cnt/3), ncols=4, figsize=(17, 8))
fig.subplots_adjust(hspace=.5) # adjust vertical space, to allow title below plot
# iterate places
for ix, ax in enumerate(axes.reshape(-1)):
    if ix >= len(top_50):
        ax.set_axis_off()
        continue
    np_str = top_50[ix]
    # filter np_str and calculate cardinality
    df_post_filter = df_post[df_post["topic_group"]==np_str].copy()
    df_post_filter = append_cardinality_df(df_post_filter, 'post_hll', 'postcount_est')
    df_post_filter = filter_fill_time(df_post_filter, 2015, 2019, 'postcount_est', max_month=8)
    # plot bars individually
    plot_time(
        df=df_post_filter, title=np_str, ax=ax)
No description has been provided for this image
In [37]:
tools.save_fig(fig, output=OUTPUT, name="barplot_massinvasion_trend")

Create notebook HTML

In [1]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./01_mass_invasion.ipynb \
    --output 01_mass_invasion \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

IOER RDC Jupyter Base Template v0.10.0