Horse collar data exploration¶

No description has been provided for this image

Binder IPYNB HTML

This notebook presents a systematic movement data exploration workflow. The workflow consists of five main steps:

  1. Establishing an overview by visualizing raw input data records
  2. Putting records in context by exploring information from consecutive movement data records (such as: time between records, speed, and direction)
  3. Extracting trajectories, locations & events by dividing the raw continuous tracks into individual trajectories, locations, and events
  4. Exploring patterns in trajectory and event data by looking at groups of the trajectories or events
  5. Analyzing outliers by looking at potential outliers and how they may challenge preconceived assumptions about the dataset characteristics

The workflow is demonstrated using horse collar tracking data provided by Prof. Lene Fischer (University of Copenhagen) and the Center for Technology & Environment of Guldborgsund Municiplaity in Denmark but should be generic enough to be applied to other tracking datasets.

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import shapely as shp
import hvplot.pandas 
import matplotlib.pyplot as plt

from pyproj import CRS
from geopandas import GeoDataFrame, read_file
from shapely.geometry import Point, LineString, Polygon
from datetime import datetime, timedelta
from holoviews import opts, dim, Layout
from os.path import exists
from urllib.request import urlretrieve

import warnings
warnings.filterwarnings('ignore')

plot_defaults = {'linewidth':5, 'capstyle':'round', 'figsize':(9,3), 'legend':True}
opts.defaults(opts.Overlay(active_tools=['wheel_zoom'], frame_width=300, frame_height=400))
hvplot_defaults = {'tiles':None, 'cmap':'Viridis', 'colorbar':True}

mpd.show_versions()
MovingPandas 0.17.0

SYSTEM INFO
-----------
python     : 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:34:57) [MSC v.1936 64 bit (AMD64)]
executable : H:\miniconda3\envs\mpd-ex\python.exe
machine    : Windows-10-10.0.19045-SP0

GEOS, GDAL, PROJ INFO
---------------------
GEOS       : None
GEOS lib   : None
GDAL       : 3.7.0
GDAL data dir: None
PROJ       : 9.2.1
PROJ data dir: H:\miniconda3\pkgs\proj-9.0.0-h1cfcee9_1\Library\share\proj

PYTHON DEPENDENCIES
-------------------
geopandas  : 0.13.2
pandas     : 2.0.3
fiona      : 1.9.4
numpy      : 1.24.4
shapely    : 2.0.1
rtree      : 1.0.1
pyproj     : 3.6.0
matplotlib : 3.7.2
mapclassify: 2.5.0
geopy      : 2.3.0
holoviews  : 1.17.0
hvplot     : 0.8.3
geoviews   : 1.9.6
stonesoup  : 1.0
In [2]:
mpd.__version__
Out[2]:
'0.17.0'

Raw data import¶

In [3]:
df = read_file('../data/horse_collar.gpkg')
df['t'] = pd.to_datetime(df['timestamp'])
df = df.set_index('t').tz_localize(None)
print("This dataset contains {} records.\nThe first lines are:".format(len(df)))
df.head()
This dataset contains 17517 records.
The first lines are:
Out[3]:
No CollarID UTC_Date UTC_Time LMT_Date LMT_Time Origin SCTS_Date SCTS_Time Latitude [?] ... C/N_10 Sat_11 C/N_11 Mort. Status Temp [?C] Activity Easting Northing timestamp geometry
t
2018-11-14 12:01:08 299 30788 14-11-2018 12:01:08 14-11-2018 12:01:08 GSM 14-11-2018 14:32:31 547433308 ... NaN NaN NaN NaN 22.0 NaN 32.687.757.575 6.070.134.340 2018-11-14 12:01:08 POINT (687757.574 6070134.334)
2018-11-14 12:15:09 300 30788 14-11-2018 12:15:09 14-11-2018 12:15:09 GSM 14-11-2018 14:32:31 546768844 ... NaN NaN NaN NaN 22.0 NaN 32.687.671.089 6.062.727.428 2018-11-14 12:15:09 POINT (687671.088 6062727.428)
2018-11-14 12:30:08 301 30788 14-11-2018 12:30:08 14-11-2018 12:30:08 GSM 14-11-2018 14:32:31 546270178 ... NaN NaN NaN NaN 21.0 NaN 32.690.932.613 6.057.307.713 2018-11-14 12:30:08 POINT (690932.614 6057307.716)
2018-11-14 13:00:33 302 30788 14-11-2018 13:00:33 14-11-2018 13:00:33 GSM 14-11-2018 14:32:31 546258926 ... NaN NaN NaN NaN 17.0 NaN 32.690.669.041 6.057.171.253 2018-11-14 13:00:33 POINT (690669.038 6057171.248)
2018-11-14 13:30:09 303 30788 14-11-2018 13:30:09 14-11-2018 13:30:09 GSM 14-11-2018 14:32:31 546261709 ... NaN NaN NaN NaN 17.0 NaN 32.690.706.054 6.057.203.820 2018-11-14 13:30:09 POINT (690706.056 6057203.814)

5 rows × 48 columns

In [4]:
df.columns
Out[4]:
Index(['No', 'CollarID', 'UTC_Date', 'UTC_Time', 'LMT_Date', 'LMT_Time',
       'Origin', 'SCTS_Date', 'SCTS_Time', 'Latitude [?]', 'Longitude [?]',
       'lat', 'long', 'FixType', 'Main [V]', 'Beacon [V]', 'Sats', 'Sat',
       'C/N', 'Sat_1', 'C/N_1', 'Sat_2', 'C/N_2', 'Sat_3', 'C/N_3', 'Sat_4',
       'C/N_4', 'Sat_5', 'C/N_5', 'Sat_6', 'C/N_6', 'Sat_7', 'C/N_7', 'Sat_8',
       'C/N_8', 'Sat_9', 'C/N_9', 'Sat_10', 'C/N_10', 'Sat_11', 'C/N_11',
       'Mort. Status', 'Temp [?C]', 'Activity', 'Easting', 'Northing',
       'timestamp', 'geometry'],
      dtype='object')
In [5]:
df = df.drop(columns=['LMT_Date', 'LMT_Time',
       'Origin', 'SCTS_Date', 'SCTS_Time', 'Latitude [?]', 'Longitude [?]',
       'FixType', 'Main [V]', 'Beacon [V]', 'Sats', 'Sat',
       'C/N', 'Sat_1', 'C/N_1', 'Sat_2', 'C/N_2', 'Sat_3', 'C/N_3', 'Sat_4',
       'C/N_4', 'Sat_5', 'C/N_5', 'Sat_6', 'C/N_6', 'Sat_7', 'C/N_7', 'Sat_8',
       'C/N_8', 'Sat_9', 'C/N_9', 'Sat_10', 'C/N_10', 'Sat_11', 'C/N_11',
       'Easting', 'Northing',], axis=1)
In [6]:
df.head()
Out[6]:
No CollarID UTC_Date UTC_Time lat long Mort. Status Temp [?C] Activity timestamp geometry
t
2018-11-14 12:01:08 299 30788 14-11-2018 12:01:08 54.743331 11.916987 NaN 22.0 NaN 2018-11-14 12:01:08 POINT (687757.574 6070134.334)
2018-11-14 12:15:09 300 30788 14-11-2018 12:15:09 54.676884 11.910876 NaN 22.0 NaN 2018-11-14 12:15:09 POINT (687671.088 6062727.428)
2018-11-14 12:30:08 301 30788 14-11-2018 12:30:08 54.627018 11.957852 NaN 21.0 NaN 2018-11-14 12:30:08 POINT (690932.614 6057307.716)
2018-11-14 13:00:33 302 30788 14-11-2018 13:00:33 54.625893 11.953686 NaN 17.0 NaN 2018-11-14 13:00:33 POINT (690669.038 6057171.248)
2018-11-14 13:30:09 303 30788 14-11-2018 13:30:09 54.626171 11.954280 NaN 17.0 NaN 2018-11-14 13:30:09 POINT (690706.056 6057203.814)
In [7]:
collar_id = df['CollarID'].unique()[0]
print("There is only one collar with ID {}.".format(collar_id))
There is only one collar with ID 30788.
In [8]:
df['Activity'].unique()
Out[8]:
array([nan])
In [9]:
original_crs = df.crs
original_crs
Out[9]:
<Projected CRS: EPSG:25832>
Name: ETRS89 / UTM zone 32N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Europe between 6°E and 12°E: Austria; Belgium; Denmark - onshore and offshore; Germany - onshore and offshore; Norway including - onshore and offshore; Spain - offshore.
- bounds: (6.0, 38.76, 12.0, 84.33)
Coordinate Operation:
- name: UTM zone 32N
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

1. Establishing an overview¶

The first step in our proposed EDA workflow can be performed directly on raw input data since it does not require temporally ordered data. It is therefore suitable as a first exploratory step when dealing with new data.

Q1.1 Geographic extent: Is the geographical extent as expected and are there holes in the spatial coverage?¶

In [10]:
df.to_crs({'init': 'epsg:4326'}).hvplot(title='Geographic extent of the dataset', geo=True, tiles='OSM')
Out[10]:

The main area (the horste's pasture?) is located south of Nykobing Strandhuse.

However, we find also find two records on the road north west of the main area. Both points have been recorded on 2018-11-14 which is the first day of the dataset.

In [11]:
pd.DataFrame(df).sort_values('lat').tail(2)
Out[11]:
No CollarID UTC_Date UTC_Time lat long Mort. Status Temp [?C] Activity timestamp geometry
t
2018-11-14 12:15:09 300 30788 14-11-2018 12:15:09 54.676884 11.910876 NaN 22.0 NaN 2018-11-14 12:15:09 POINT (687671.088 6062727.428)
2018-11-14 12:01:08 299 30788 14-11-2018 12:01:08 54.743331 11.916987 NaN 22.0 NaN 2018-11-14 12:01:08 POINT (687757.574 6070134.334)

A potential hypothesis for the origin of these two records is that the horse (or the collar) was transported on 2018-11-14, taking the road from Nykobing Falster south to the pasture.

If we remove these first two records from the dataset, the remainder of the records are located in a small area:

In [12]:
df = df[2:].to_crs({'init': 'epsg:4326'})
( df.hvplot(title='OSM showing paths and fences', size=2, geo=True, tiles='OSM') +
  df.hvplot(title='Imagery showing land cover details', size=2, color='red', geo=True, tiles='EsriImagery') )
Out[12]: