Temporal chi visualization: Milvus milvus & Aves (iNaturalist)

Alexander Dunkel, Leibniz Institute of Ecological Urban and Regional Development,
Transformative Capacities & Research Data Centre (IÖR-FDZ)

Publication:
Dunkel, A., Burghardt, D. (2024). Assessing perceived landscape change from opportunistic spatio-temporal occurrence data. Land 2024

No description has been provided for this image
•••
Out[152]:

Last updated: Jul-19-2024, Carto-Lab Docker Version 0.14.0

Chi visualization of temporal patterns for ephemeral events. This notebook builds upon a a previous notebook.

The chi equation requires a specific query (such as all photographs of the Red Kite - Milvus milvus - in Europe from iNaturalist) and a generic query to correct for system-wide sampling effects (such as growing number of iNaturalist contributors). However, different user groups follow different distribution patterns over time and space on geosocial media, which means that it is not easy to select a suitable generic query. Rapacciuolo et al. (2021) emphasize that comparison should be based on user selections from umbrella groups. Here, we explore whether using Chi for Milvus mulvus photographers compared to the umbrella group of "all bird photographers" on iNaturalist in Europe increases or decreases between 2010 to 2022. Given the increasing population of Milvus milvus, we would expect that chi also increases, as it is more likeley that bird photographers can take photographs of the Red Kite.

Rapacciuolo, G., Young, A., & Johnson, R. (2021). Deriving indicators of biodiversity change from unstructured community‐contributed data. Oikos, 130(8), 1225–1239. https://doi.org/10.1111/oik.08215

For selection of the generic query, we used the emoji-taxa mappings that I made available for lbsn.vgiscience.org:

The SQL syntax:


SELECT *
FROM topical.post t1
WHERE t1.origin_id = 23 and '🐦'=ANY(emoji);

This selected 9,147,013 bird photographs from a total number of 53,155,437 nature and plant observations on iNaturalist (2010-2022).

This dataset is filtered here based on space, used in the chi equation which is then visualized as a bar plot.

Preparations

•••
In [154]:
OUTPUT = Path.cwd().parents[0] / "out"       # output directory for figures (etc.)
WORK_DIR = Path.cwd().parents[0] / "tmp"     # Working directory
In [155]:
OUTPUT.mkdir(exist_ok=True)
(OUTPUT / "figures").mkdir(exist_ok=True)
(OUTPUT / "svg").mkdir(exist_ok=True)
WORK_DIR.mkdir(exist_ok=True)
In [156]:
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Select M for monthly aggregation, Y for yearly aggregation

In [157]:
AGG_BASE = "Y"

First, define whether to study usercount or postcount

In [158]:
# METRIC = 'user'
METRIC = 'post'
In [159]:
metric_col = 'post_hll'
if METRIC == 'user':
    metric_col = 'user_hll'

Set global font

In [160]:
plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = ['Times New Roman'] + plt.rcParams['font.serif']

Set global color

In [161]:
color_flickr = '#F89F5E'
color_inaturalist = '#ED6F53'

Load HLL aggregate data

Load the data from CSV, generated in the previous notebook. Data is stored as aggregate HLL data (postcount, usercount) for each month.

•••
In [163]:
%%time
data_files = {
    "INATURALIST_ALL":INATURALIST_ALL,
    "INATURALIST_ALL_MILVUSRANGE":INATURALIST_ALL_MILVUSRANGE,
    "INATURALIST_ALL_AVES":INATURALIST_ALL_AVES_MILVUSRANGE,    
    }
tools.display_file_stats(data_files)
name INATURALIST_ALL INATURALIST_ALL_MILVUSRANGE INATURALIST_ALL_AVES
size 989.88 KB 1.24 MB 555.78 KB
records 198 702 198
CPU times: user 5.8 ms, sys: 0 ns, total: 5.8 ms
Wall time: 5.37 ms
In [164]:
pd.read_csv(INATURALIST_ALL_AVES_MILVUSRANGE, nrows=10)
Out[164]:
year month post_hll user_hll
0 2007 1 \x138b40026402a204010542096210a116221a811b011c... \x138b400841136518e12681276228e1296232c355a15c...
1 2007 2 \x138b4004a105a30e810f45106211c115c1166519c11a... \x138b400501296232c33a413ae14b214ca150e155a16c...
2 2007 3 \x138b40014101a2066309210a220a810ac20b050be20c... \x138b4002210282070116611862276229622e8332c33a...
3 2007 4 \x138b40016101a203e10463058306c107e1084308810b... \x138b400421070108640b010fe31342252125e426a127...
4 2007 5 \x138b40022102a302e304a5078207a808a10b210c410d... \x138b4007010fa11ba526a1276228022b012da12e8332...
5 2007 6 \x138b40002100a2018102c10321048405230683072107... \x138b4004830864176318a118e11ba126c4276229622b...
6 2007 7 \x138b4000c10141024203e104a204c1058208e30a230e... \x138b40070108640b010b840c2411e112e1156215c118...
7 2007 8 \x138b400121088609810bc30e6114211662178217c218... \x138b4008640b0111e11502186218a11d031dc1224122...
8 2007 9 \x138b4000a1030309810c2313041c0123a424a126a127... \x138b400701094309640fe3142118a127622e832fe150...
9 2007 10 \x138b40040204c308e109430a41164117c119a227e130... \x138b40042104810fe32da12e832fe13a413da13f413f...

Connect hll worker db

For calculating with HLL data, we are using an empty Postgres database with the Citus HLL extension installed. Specifically, we are using pg-hll-empty here, a Docker-postgres container that is prepared for HLL calculation. You can use any Postgres from anywhere, as long as it has the citus hll extension installed.

If you haven't, startup the container locally next to Carto-Lab Docker now:

cd pg-hll-empty
docker compose up -d
In [165]:
DB_USER = "hlluser"
DB_PASS = os.getenv('READONLY_USER_PASSWORD')
# set connection variables
DB_HOST = "127.0.0.1"
DB_PORT = "5452"
DB_NAME = "hllworkerdb"

Connect to empty Postgres database running HLL Extension:

In [166]:
DB_CONN = psycopg2.connect(
        host=DB_HOST,
        port=DB_PORT ,
        dbname=DB_NAME,
        user=DB_USER,
        password=DB_PASS
)
DB_CONN.set_session(
    readonly=True)
DB_CALC = tools.DbConn(
    DB_CONN)
CUR_HLL = DB_CONN.cursor()

test

Calculate HLL Cardinality per month and year

Define additional functions for reading and formatting CSV as pd.DataFrame

In [167]:
from datetime import datetime

def read_csv_datetime(csv: Path) -> pd.DataFrame:
    """Read CSV with parsing datetime index (months)
    
        First CSV column: Year
        Second CSV column: Month
    """
    date_cols = ["year", "month"]
    df = pd.read_csv(
        csv, index_col='datetime', 
        parse_dates={'datetime':date_cols},
        date_format='%Y %m',
        keep_date_col='False')
    df.drop(columns=date_cols, inplace=True)
    return df
    
def append_cardinality_df(df: pd.DataFrame, hll_col: str = "post_hll", cardinality_col: str = 'postcount_est'):
    """Calculate cardinality from HLL and append to extra column in df"""
    df[cardinality_col] = df.apply(
        lambda x: hll.cardinality_hll(
           x[hll_col], CUR_HLL),
        axis=1)
    df.drop(columns=[hll_col], inplace=True)
    return df

def filter_fill_time(
        df: pd.DataFrame, min_year: int, 
        max_year: int, val_col: str = "postcount_est",
        min_month: str = "01", max_month: str = "01", agg_base: str = None,
        agg_method = None):
    """Filter time values between min - max year and fill missing values"""
    max_day = "01"
    if agg_base is None:
        agg_base = "M"
    elif agg_base == "Y":
        max_month = "12"
        max_day = "31"
    min_date = pd.Timestamp(f'{min_year}-{min_month}-01')
    max_date = pd.Timestamp(f'{max_year}-{max_month}-{max_day}')
    # clip by start and end date
    if not min_date in df.index:
        df.loc[min_date, val_col] = 0
    if not max_date in df.index:
        df.loc[max_date, val_col] = 0
    df.sort_index(inplace=True)
    # mask min and max time
    time_mask = ((df.index >= min_date) & (df.index <= max_date))
    resampled = df.loc[time_mask][val_col].resample(agg_base)
    if agg_method is None:
        series = resampled.sum()
    elif agg_method == "count":
        series = resampled.count()
    elif agg_method == "nunique":
        series = resampled.nunique()
    # fill missing months with 0
    # this will also set the day to max of month
    return series.fillna(0).to_frame()

Select dataset to process below

Apply functions to all data sets.

  • Read from CSV
  • calculate cardinality
  • merge year and month to single column
  • filter 2007 - 2018 range, fill missing values
In [168]:
def process_dataset(
        dataset: Path = None, metric: str = None, df_post: pd.DataFrame = None,
        min_year: int = None, max_year: int = None, agg_base: str = None) -> pd.DataFrame:
    """Apply temporal filter/pre-processing to all data sets."""
    if metric is None:
        metric = 'post_hll'
        warn(f"Using default value {metric}")
    if metric == 'post_hll':
        cardinality_col = 'postcount_est'
    else:
        cardinality_col = 'usercount_est'
    if min_year is None:
        min_year = 2007
    if max_year is None:
        max_year = 2022
    if df_post is None:
        df_post = read_csv_datetime(dataset)
    df_post = append_cardinality_df(df_post, metric, cardinality_col)
    return filter_fill_time(df_post, min_year, max_year, cardinality_col, agg_base=agg_base)
In [169]:
%%time
df_post = process_dataset(INATURALIST_ALL_AVES_MILVUSRANGE, agg_base=AGG_BASE, metric='post_hll')
CPU times: user 13.2 ms, sys: 4.74 ms, total: 18 ms
Wall time: 40 ms
In [170]:
df_post.head(5)
Out[170]:
postcount_est
datetime
2007-12-31 1358.0
2008-12-31 1300.0
2009-12-31 1950.0
2010-12-31 2464.0
2011-12-31 2184.0
In [171]:
%%time
df_user = process_dataset(INATURALIST_ALL_AVES_MILVUSRANGE, metric=metric_col, agg_base=AGG_BASE)
CPU times: user 9.94 ms, sys: 6.83 ms, total: 16.8 ms
Wall time: 41.7 ms
In [172]:
df_user.head(5)
Out[172]:
postcount_est
datetime
2007-12-31 1358.0
2008-12-31 1300.0
2009-12-31 1950.0
2010-12-31 2464.0
2011-12-31 2184.0

Visualize Cardinality

Define plot function.

In [173]:
def bar_plot_time(
        df: pd.DataFrame, ax: Axes, color: str,
        label: str, val_col: str = "postcount_est") -> Axes:
    """Matplotlib Barplot with time axis formatting

    If "significant" in df columns, applies different colors to fill/edge
    of non-significant values.
    """
    if color is None:
        # colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
        # color_rgba = colors([1.0])[0]
        # color = mcolor.rgb2hex((color_rgba), keep_alpha=True)
        color = color_inaturalist
    color_significant = color
    color_significant_edge = "white"
    if "significant" in df.columns:
        colors_bar = {True: color, False: "white"}
        color_significant = df['significant'].replace(colors_bar)
        colors_edge = {True: "white", False: "black"}
        color_significant_edge = df['significant'].replace(colors_edge)
    ax = df.set_index(
        df.index.map(lambda s: s.strftime('%Y'))).plot.bar(
            ax=ax, y=val_col, color=color_significant, width=1.0,
            label=label, edgecolor=color_significant_edge, linewidth=0.5, alpha=1.0)
    return ax

def replace_hyphen(x):
    """Change hyphen (-) into a minus sign (−, “U+2212”) python
    according to editor request
    """
    return x.replace('-','−')
    
@mticker.FuncFormatter
def major_formatter(x, pos):
    """Limit thousands separator (comma) to five or more digits
    according ton editor request
    """
    s = format(x, '.0f')
    if len(s) <= 4:
        return replace_hyphen(s)
    return replace_hyphen(format(x, ',.0f'))

def plot_time(
        df: Tuple[pd.DataFrame, pd.DataFrame], title, color = None, filename = None, 
        output = OUTPUT, legend: str = "Postcount", val_col: str = None,
        trend: bool = None, seasonal: bool = None, residual: bool = None,
        agg_base: str = None):
    """Create dataframe(s) time plot"""
    x_ticks_every = 12
    fig_x = 15.7
    fig_y = 4.27
    font_mod = False
    x_label = "Month"
    linewidth = 3
    if agg_base and agg_base == "Y":
        x_ticks_every = 1
        fig_x = 3
        fig_y = 1.5
        font_mod = True
        x_label = "Year"
        linewidth = 1
    fig, ax = plt.subplots()
    fig.set_size_inches(fig_x, fig_y)
    ylabel = f'{legend}'
    if val_col is None:
        val_col = f'{legend.lower()}_est'
    ax = bar_plot_time(
        df=df, ax=ax, color=color, val_col=val_col, label=legend)
    # x axis ticker formatting
    tick_loc = mticker.MultipleLocator(x_ticks_every)
    ax.xaxis.set_major_locator(tick_loc)
    ax.tick_params(axis='x', rotation=45, length=0)
    ax.yaxis.set_major_formatter(major_formatter)
    ax.set(xlabel=x_label, ylabel=ylabel)
    ax.spines["left"].set_linewidth(0.25)
    ax.spines["bottom"].set_linewidth(0.25)
    ax.spines["top"].set_linewidth(0)
    ax.spines["right"].set_linewidth(0)
    ax.yaxis.set_tick_params(width=0.5)
    # remove legend
    ax.get_legend().remove()
    ax.set_title(title)
    ax.set_xlim(-0.5,len(df)-0.5)
    if font_mod:
        for item in (
            [ax.xaxis.label, ax.title, ax.yaxis.label] +
             ax.get_xticklabels() + ax.get_yticklabels()):
                item.set_fontsize(8)
    if any([trend, seasonal, residual]):
        # seasonal decompose
        decomposition = sm.tsa.seasonal_decompose(
            df[val_col], model='additive')
        # plot trend part only
        if trend:
            plt.plot(list(decomposition.trend), color='black',
                label="Trend", linewidth=linewidth, alpha=0.8)
        if seasonal:
            plt.plot(list(decomposition.seasonal), color='black', linestyle='dotted',
                label="Seasonal", linewidth=1, alpha=0.8)
        if residual:
            plt.plot(list(decomposition.resid), color='black', linestyle='dashed',
                label="Residual", linewidth=1, alpha=0.8)
        # trend.plot(ax=ax)
    # store figure to file
    if filename:
        fig.savefig(
            output / "figures" / f"{filename}.png", dpi=300, format='PNG',
            bbox_inches='tight', pad_inches=1, facecolor="white")
        # also save as svg
        fig.savefig(
            output / "svg" / f"{filename}.svg", format='svg',
            bbox_inches='tight', pad_inches=1, facecolor="white")
In [174]:
def load_and_plot(
        dataset: Path = None, metric: str = None, src_ref: str = "flickr", colors: cm.colors.ListedColormap = None,
        agg_base: str = None, trend: bool = None, return_df: bool = None, df_post: pd.DataFrame = None,):
    """Load data and plot"""
    if metric is None:
        metric = 'post_hll'
    if metric == 'post_hll':
        metric_label = 'postcount'
    else:
        metric_label = 'usercount'
    if colors is None:
        # colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
        # colors = colors([1.0])
        colors = color_inaturalist
    df = process_dataset(dataset, metric=metric, agg_base=agg_base, df_post=df_post)
    plot_time(
        df, legend=metric_label.capitalize(), color=colors,
        title=f'{src_ref.capitalize().replace(" ", "_")} {metric_label} over time', 
        filename=f"temporal_{metric_label}_{src_ref}_absolute", trend=trend, agg_base=agg_base)
    if return_df:
        return df
In [175]:
colors = sns.color_palette("vlag", as_cmap=True, n_colors=2)
In [176]:
load_and_plot(INATURALIST_ALL_AVES_MILVUSRANGE, src_ref=f"iNaturalist Aves + Milvus milvus range", agg_base=AGG_BASE, trend=False, metric=metric_col)
No description has been provided for this image

Plot iNaturalist Milvus range

In [177]:
load_and_plot(INATURALIST_ALL_MILVUSRANGE, src_ref=f"iNaturalist All + Milvus milvus range", agg_base=AGG_BASE, trend=False, metric=metric_col)
No description has been provided for this image

Repeat for all iNaturalist data

In [178]:
load_and_plot(INATURALIST_ALL, src_ref=f"inaturalist_{AGG_BASE}", agg_base=AGG_BASE, trend=False)
load_and_plot(INATURALIST_ALL, src_ref=f"inaturalist_{AGG_BASE}", metric="user_hll", colors=colors([0.0]), agg_base=AGG_BASE, trend=False)
No description has been provided for this image
No description has been provided for this image

Calculate the Chi-value

Prepare Chi

This is adapted from notebook three of the original publication.

First, define the input parameter:

  • dof: degrees of freedom (dof) is calculated: (variables - 1) with the variables being: observed posts, expected posts
  • chi_crit_val: given a dof value of 1 and a confidence interval of 0.05, the critical value to accept or neglect the h0 is 3.84
  • chi_column: we'll do the chi calculation based on the usercount-column, but this notebook can be run for postcount or userdays, too.
In [179]:
DOF = 1
CHI_CRIT_VAL = 3.84
CHI_COLUMN: str = f"{METRIC}count_est"
In [180]:
def calc_norm(
    df_expected: pd.DataFrame,
    df_observed: pd.DataFrame,
    chi_column: str = CHI_COLUMN):
    """Fetch the number of data points for the observed and 
    expected dataset by the relevant column
    and calculate the normalisation value
    """
    v_expected = df_expected[chi_column].sum()
    v_observed = df_observed[chi_column].sum()
    norm_val = (v_expected / v_observed)
    return norm_val
In [181]:
def chi_calc(x_observed: float, x_expected: float, x_normalized: float) -> pd.Series:
    """Apply chi calculation based on observed (normalized) and expected value"""
    value_observed_normalised = x_observed * x_normalized
    a = value_observed_normalised - x_expected
    b = math.sqrt(x_expected)
    # native division with division by zero protection
    chi_value = a / b if b else 0
    return chi_value
    
def apply_chi_calc(
        df: pd.DataFrame, norm_val: float,
        chi_column: str = CHI_COLUMN, chi_crit_val: float = CHI_CRIT_VAL):
    """Calculate chi-values based on two GeoDataFrames (expected and observed values)
    and return new grid with results"""
    # lambda: apply function chi_calc() to each item
    df['chi_value'] = df.apply(
        lambda x: chi_calc(
           x[chi_column],
           x[f'{chi_column}_expected'],
           norm_val),
        axis=1)
    # add significant column, default False
    df['significant'] = False
    # calculate significance for both negative and positive chi_values
    df.loc[np.abs(df['chi_value'])>chi_crit_val, 'significant'] = True

Visualize Chi for subqueries

Proportion of "Milvus milvus" observations vs. all iNaturalist observations

First load the worldwide dataset

In [182]:
src = Path.cwd().parents[0] / "00_data" / "milvus" / "observations-350501.csv"
In [183]:
load_inat_kwargs = {
    "filepath_or_buffer":src,
    "index_col":'datetime', 
    "parse_dates":{'datetime':["observed_on"]},
    "date_format":'%Y-%m-%d',
    "keep_date_col":'False',
    "usecols":["id", "observed_on", "user_id"]
}
df = pd.read_csv(**load_inat_kwargs)
In [184]:
df.drop(columns=['observed_on'], inplace=True)
In [185]:
df.head()
Out[185]:
id user_id
datetime
2010-05-22 7793 446
2011-01-01 9495 351
2011-01-01 9496 351
2011-04-17 14694 654
2011-10-05 50134 4130
In [186]:
val_col = "id"
agg_method = "count"
metric_label="observations"
if METRIC == "user":
    val_col = "user_id"
    agg_method = "nunique"
    metric_label="observers"
In [187]:
df_milvus = filter_fill_time(
    df, 2007, 2022, val_col=val_col, agg_base=AGG_BASE, agg_method=agg_method)
In [188]:
df_milvus.rename(columns={val_col: metric_label}, inplace=True)
In [189]:
src_ref="iNaturalist Milvus milvus"
In [190]:
plot_time(
        df_milvus, legend=metric_label.capitalize(),
        title=f'{src_ref.capitalize()} {metric_label} over time\n worldwide', val_col=metric_label,
        filename=f"temporal_iNaturalist_milvusmilvus_absolute", trend=False, agg_base=AGG_BASE)
No description has been provided for this image
In [191]:
df_milvus.sum()
Out[191]:
observations    15896
dtype: int64

Limit to Milvus milvus range

In [192]:
CRS_WGS = "epsg:4326" # WGS1984
CRS_PROJ = "esri:54009" # Mollweide
In [193]:
load_inat_kwargs["usecols"] = ["id", "user_id", "observed_on", "longitude", "latitude"]
df = pd.read_csv(**load_inat_kwargs)
In [194]:
load_inat_kwargs['usecols'] = None
In [195]:
df.dropna(subset=['longitude', 'latitude'], inplace=True)
In [196]:
df
Out[196]:
id observed_on user_id latitude longitude
datetime
2010-05-22 7793 2010-05-22 446 52.440923 -3.953872
2011-01-01 9495 2011-01-01 351 51.185595 -1.212616
2011-01-01 9496 2011-01-01 351 51.280226 -0.925083
2011-04-17 14694 2011-04-17 654 51.814146 -0.192910
2011-08-03 50348 2011-08-03 4130 50.406329 8.873863
... ... ... ... ... ...
2023-07-02 178281936 2023-07-02 5483063 47.661184 8.854844
2023-07-29 178325879 2023-07-29 1859910 51.047883 10.996382
2013-06-09 178326364 2013-06-09 6476964 51.454250 -0.884408
2023-08-12 178333330 2023-08-12 4228845 47.961332 16.769473
2023-08-13 178333356 2023-08-13 186190 45.129853 7.695633

20106 rows × 5 columns

Intersect with the spatial range that is available as kml from iNaturalist.

In [197]:
import geopandas as gp
milvus_range = gp.read_file(
    OUTPUT/ 'Milvusmilvus_range.gpkg', layer='Milvus milvus')
In [198]:
gdf = gp.GeoDataFrame(
    df, geometry=gp.points_from_xy(df.longitude, df.latitude), crs=CRS_WGS)

Intersect, keep only observations within range.

In [199]:
gdf_overlay = gp.overlay(
    gdf, milvus_range,
    how='intersection')
In [200]:
ax = gdf_overlay.plot(markersize=.1)
ax.axis('off')
plt.show()
No description has been provided for this image

Calculate chi

In [201]:
gdf_overlay
Out[201]:
id observed_on user_id latitude longitude Name Description geometry
0 7793 2010-05-22 446 52.440923 -3.953872 POINT (-3.95387 52.44092)
1 9495 2011-01-01 351 51.185595 -1.212616 POINT (-1.21262 51.18559)
2 9496 2011-01-01 351 51.280226 -0.925083 POINT (-0.92508 51.28023)
3 14694 2011-04-17 654 51.814146 -0.192910 POINT (-0.19291 51.81415)
4 50348 2011-08-03 4130 50.406329 8.873863 POINT (8.87386 50.40633)
... ... ... ... ... ... ... ... ...
19067 178281936 2023-07-02 5483063 47.661184 8.854844 POINT (8.85484 47.66118)
19068 178325879 2023-07-29 1859910 51.047883 10.996382 POINT (10.99638 51.04788)
19069 178326364 2013-06-09 6476964 51.454250 -0.884408 POINT (-0.88441 51.45425)
19070 178333330 2023-08-12 4228845 47.961332 16.769473 POINT (16.76947 47.96133)
19071 178333356 2023-08-13 186190 45.129853 7.695633 POINT (7.69563 45.12985)

19072 rows × 8 columns

In [202]:
gdf_overlay['datetime'] = pd.to_datetime(gdf_overlay["observed_on"], format=load_inat_kwargs.get("date_format"))
gdf_overlay.set_index('datetime', inplace=True)
gdf_cleaned = tools.drop_cols_except(gdf_overlay, [val_col], inplace=False)
In [203]:
df_milvus_range = filter_fill_time(
    gdf_cleaned, 2007, 2022, val_col=val_col, agg_base=AGG_BASE, agg_method=agg_method)
In [204]:
df_milvus_range.rename(columns={val_col: metric_label}, inplace=True)
In [205]:
src_ref="iNaturalist Milvus milvus"
In [206]:
df_milvus_range.sum()
Out[206]:
observations    14988
dtype: int64

Conclusion: Majority of observations is within the spatial range kml (16k total vs 15k in spatial range)

In [207]:
plot_time(
        df_milvus_range, legend=metric_label.capitalize(),
        title=f'{src_ref.capitalize()} {metric_label} over time\n in Europe', val_col=metric_label,
        filename=f"temporal_iNaturalist_milvusmilvus_absolute", trend=False, agg_base=AGG_BASE)
No description has been provided for this image

Calculate chi: Proportion to all iNaturalist photographs

First: Calculate chi based on all iNaturalist observations within Europe

In [208]:
df_expected = load_and_plot(
    INATURALIST_ALL_MILVUSRANGE, src_ref=f"inaturalist_{AGG_BASE}",
    agg_base=AGG_BASE, trend=False, return_df=True, metric=metric_col)
No description has been provided for this image
In [209]:
norm_val = calc_norm(
    df_expected.rename(columns={f'{METRIC}count_est':metric_label}, inplace=False),
    df_milvus_range, chi_column = metric_label)
print(norm_val)
383.41553242594074
In [210]:
df_expected.rename(columns={f'{METRIC}count_est':f'{metric_label}_expected'}, inplace=True)
In [211]:
merge_cols = [f'{METRIC}count_est']
df_expected_observed_inat = df_expected.merge(
    df_milvus_range[metric_label],
    left_index=True, right_index=True)
In [212]:
%%time
apply_chi_calc(
    df=df_expected_observed_inat,
    norm_val=norm_val, chi_column=metric_label)
CPU times: user 1.06 ms, sys: 600 µs, total: 1.66 ms
Wall time: 1.58 ms
In [213]:
plot_time(
    df_expected_observed_inat, legend="Signed Chi", val_col="chi_value",
    title=f'Chi for Red kite {metric_label} proportional to \nall iNaturalist {metric_label} in Europe over time', 
    filename=f"temporal_chi_inaturalist_{metric_label}_milvus_{AGG_BASE}_{metric_label}", trend=False, 
    seasonal=False, residual=False, agg_base=AGG_BASE)
No description has been provided for this image

We can observe a proportional increase of Red kite observations between 2012 and 2019.

Calculate chi: Proportion of Milvus milvus vs. "Aves" tax class observations

In [214]:
df_expected = load_and_plot(
    INATURALIST_ALL_AVES_MILVUSRANGE, src_ref=f"inaturalist_{AGG_BASE}_{metric_label}",
    agg_base=AGG_BASE, trend=False, return_df=True, metric=metric_col)
No description has been provided for this image
In [215]:
norm_val = calc_norm(
    df_expected.rename(columns={f'{METRIC}count_est':metric_label}, inplace=False),
    df_milvus_range, chi_column = metric_label)
print(norm_val)
44.15238857752869
In [216]:
df_expected.rename(columns={f'{METRIC}count_est':f'{metric_label}_expected'}, inplace=True)
In [217]:
merge_cols = [f'{METRIC}count_est']
df_expected_observed_inat = df_expected.merge(
    df_milvus_range[metric_label],
    left_index=True, right_index=True)
In [218]:
%%time
apply_chi_calc(
    df=df_expected_observed_inat,
    norm_val=norm_val, chi_column=metric_label)
CPU times: user 1.47 ms, sys: 0 ns, total: 1.47 ms
Wall time: 1.46 ms
In [219]:
df_expected_observed_inat
Out[219]:
observations_expected observations chi_value significant
datetime
2007-12-31 1358.0 9 -26.067872 True
2008-12-31 1300.0 18 -14.013308 True
2009-12-31 1950.0 18 -26.161420 True
2010-12-31 2464.0 37 -16.728112 True
2011-12-31 2184.0 60 9.953148 True
2012-12-31 2770.0 60 -2.296311 False
2013-12-31 4136.0 132 26.311141 True
2014-12-31 6645.0 199 26.268493 True
2015-12-31 10395.0 277 18.000058 True
2016-12-31 13774.0 379 25.218878 True
2017-12-31 22362.0 485 -6.340083 True
2018-12-31 33030.0 702 -11.197346 True
2019-12-31 66383.0 1311 -32.987588 True
2020-12-31 114893.0 2797 25.375454 True
2021-12-31 171350.0 3756 -13.319731 True
2022-12-31 206762.0 4748 6.319490 True
In [220]:
plot_time(
    df_expected_observed_inat, legend="Signed Chi", val_col="chi_value",
    title=f'Chi for Red kite {metric_label} proportional to \nall bird {metric_label} in Europe over time', 
    filename=f"temporal_chi_inaturalist_{metric_label}_milvus_aves_{AGG_BASE}", trend=False, 
    seasonal=False, residual=False, agg_base=AGG_BASE)
No description has been provided for this image

Proportion of "Aves" tax class observations vs. all iNaturalist observations

  • both queries limited to Milvus milvus range in Europe

Expected: All iNaturalist

  • limited to Europe
In [143]:
df_expected = load_and_plot(
    INATURALIST_ALL_MILVUSRANGE, src_ref=f"inaturalist_{AGG_BASE}_{metric_label}",
    agg_base=AGG_BASE, trend=False, return_df=True, metric=metric_col)
No description has been provided for this image

Observed: All Aves tax class observations

  • limited to Europe
In [144]:
df_observed = load_and_plot(
    INATURALIST_ALL_AVES_MILVUSRANGE, src_ref=f"inaturalist_{AGG_BASE}_{metric_label}",
    agg_base=AGG_BASE, trend=False, return_df=True, metric=metric_col)
No description has been provided for this image
In [145]:
norm_val = calc_norm(
    df_expected.rename(columns={f'{METRIC}count_est':metric_label}), df_observed.rename(columns={f'{METRIC}count_est':metric_label}), chi_column = metric_label)
print(norm_val)
8.683913708375897
In [146]:
df_observed.head()
Out[146]:
postcount_est
datetime
2007-12-31 1358.0
2008-12-31 1300.0
2009-12-31 1950.0
2010-12-31 2464.0
2011-12-31 2184.0
In [147]:
df_expected.rename(columns={f'{METRIC}count_est':f'{metric_label}_expected'}, inplace=True)
df_observed.rename(columns={f'{METRIC}count_est':metric_label}, inplace=True)
In [148]:
merge_cols = [f'{METRIC}count_est']
df_expected_observed_inat = df_expected.merge(
    df_observed[metric_label],
    left_index=True, right_index=True)
In [149]:
df_expected_observed_inat.head()
Out[149]:
observations_expected observations
datetime
2007-12-31 15728.0 1358.0
2008-12-31 16534.0 1300.0
2009-12-31 20439.0 1950.0
2010-12-31 21873.0 2464.0
2011-12-31 23299.0 2184.0
In [150]:
%%time
apply_chi_calc(
    df=df_expected_observed_inat,
    norm_val=norm_val, chi_column=metric_label)
CPU times: user 867 µs, sys: 508 µs, total: 1.37 ms
Wall time: 1.36 ms
In [151]:
plot_time(
    df_expected_observed_inat, legend="Signed Chi", val_col="chi_value",
    title=f'Chi for all bird {metric_label} proportional to \nall iNaturalist {metric_label} in Europe over time', 
    filename=f"temporal_chi_inaturalist_{metric_label}_milvus_{AGG_BASE}", trend=False, 
    seasonal=False, residual=False, agg_base=AGG_BASE)
No description has been provided for this image

Interestingly, there was also proportional peak of Aves observations on iNaturalist between 2024 and 2019. This suggests that the Red Kite peak was indeed part of a larger peak of bird observations on iNaturalist in these years.

Create Release File

Create a release file that contains ipynb notebooks, HTML, figures, svg and python converted files.

Make sure that 7z is available (apt-get install p7zip-full)

In [225]:
!cd .. && git config --system --add safe.directory '*' \
    && RELEASE_VERSION=$(git describe --tags --abbrev=0) \
    && 7z a -tzip -mx=9 out/release_$RELEASE_VERSION.zip \
    md/* py/* out/*.csv resources/* out/svg/* \
    out/figures/* notebooks/*.ipynb \
    CHANGELOG.md README.md jupytext.toml nbconvert.tpl conf.json pyproject.toml \
    -x!py/__pycache__ -x!py/modules/__pycache__ -x!py/modules/.ipynb_checkpoints \
    -y > /dev/null
In [226]:
!RELEASE_VERSION=$(git describe --tags --abbrev=0) && ls -alh ../out/release_$RELEASE_VERSION.zip
-rw-r--r-- 1 root root 19M Jul 19 07:08 ../out/release_v1.2.2.zip

Create notebook HTML

In [1]:
!jupyter nbconvert --to html_toc \
    --output-dir=../resources/html/ ./10_milvus_chi.ipynb \
    --template=../nbconvert.tpl \
    --ExtractOutputPreprocessor.enabled=False >&- 2>&-

IOER RDC Jupyter Base Template v0.10.0