Horse collar data exploration¶

This notebook presents a systematic movement data exploration workflow. The workflow consists of five main steps:
- Establishing an overview by visualizing raw input data records
- Putting records in context by exploring information from consecutive movement data records (such as: time between records, speed, and direction)
- Extracting trajectories, locations & events by dividing the raw continuous tracks into individual trajectories, locations, and events
- Exploring patterns in trajectory and event data by looking at groups of the trajectories or events
- Analyzing outliers by looking at potential outliers and how they may challenge preconceived assumptions about the dataset characteristics
The workflow is demonstrated using horse collar tracking data provided by Prof. Lene Fischer (University of Copenhagen) and the Center for Technology & Environment of Guldborgsund Municiplaity in Denmark but should be generic enough to be applied to other tracking datasets.
import numpy as np
import pandas as pd
import geopandas as gpd
import movingpandas as mpd
import shapely as shp
import hvplot.pandas
import matplotlib.pyplot as plt
from pyproj import CRS
from geopandas import GeoDataFrame, read_file
from shapely.geometry import Point, LineString, Polygon
from datetime import datetime, timedelta
from holoviews import opts, dim, Layout
from os.path import exists
from urllib.request import urlretrieve
import warnings
warnings.filterwarnings('ignore')
plot_defaults = {'linewidth':5, 'capstyle':'round', 'figsize':(9,3), 'legend':True}
opts.defaults(opts.Overlay(active_tools=['wheel_zoom'], frame_width=300, frame_height=400))
hvplot_defaults = {'tiles':None, 'cmap':'Viridis', 'colorbar':True}
mpd.show_versions()
MovingPandas 0.17.0 SYSTEM INFO ----------- python : 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:34:57) [MSC v.1936 64 bit (AMD64)] executable : H:\miniconda3\envs\mpd-ex\python.exe machine : Windows-10-10.0.19045-SP0 GEOS, GDAL, PROJ INFO --------------------- GEOS : None GEOS lib : None GDAL : 3.7.0 GDAL data dir: None PROJ : 9.2.1 PROJ data dir: H:\miniconda3\pkgs\proj-9.0.0-h1cfcee9_1\Library\share\proj PYTHON DEPENDENCIES ------------------- geopandas : 0.13.2 pandas : 2.0.3 fiona : 1.9.4 numpy : 1.24.4 shapely : 2.0.1 rtree : 1.0.1 pyproj : 3.6.0 matplotlib : 3.7.2 mapclassify: 2.5.0 geopy : 2.3.0 holoviews : 1.17.0 hvplot : 0.8.3 geoviews : 1.9.6 stonesoup : 1.0
mpd.__version__
'0.17.0'
Raw data import¶
df = read_file('../data/horse_collar.gpkg')
df['t'] = pd.to_datetime(df['timestamp'])
df = df.set_index('t').tz_localize(None)
print("This dataset contains {} records.\nThe first lines are:".format(len(df)))
df.head()
This dataset contains 17517 records. The first lines are:
No | CollarID | UTC_Date | UTC_Time | LMT_Date | LMT_Time | Origin | SCTS_Date | SCTS_Time | Latitude [?] | ... | C/N_10 | Sat_11 | C/N_11 | Mort. Status | Temp [?C] | Activity | Easting | Northing | timestamp | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t | |||||||||||||||||||||
2018-11-14 12:01:08 | 299 | 30788 | 14-11-2018 | 12:01:08 | 14-11-2018 | 12:01:08 | GSM | 14-11-2018 | 14:32:31 | 547433308 | ... | NaN | NaN | NaN | NaN | 22.0 | NaN | 32.687.757.575 | 6.070.134.340 | 2018-11-14 12:01:08 | POINT (687757.574 6070134.334) |
2018-11-14 12:15:09 | 300 | 30788 | 14-11-2018 | 12:15:09 | 14-11-2018 | 12:15:09 | GSM | 14-11-2018 | 14:32:31 | 546768844 | ... | NaN | NaN | NaN | NaN | 22.0 | NaN | 32.687.671.089 | 6.062.727.428 | 2018-11-14 12:15:09 | POINT (687671.088 6062727.428) |
2018-11-14 12:30:08 | 301 | 30788 | 14-11-2018 | 12:30:08 | 14-11-2018 | 12:30:08 | GSM | 14-11-2018 | 14:32:31 | 546270178 | ... | NaN | NaN | NaN | NaN | 21.0 | NaN | 32.690.932.613 | 6.057.307.713 | 2018-11-14 12:30:08 | POINT (690932.614 6057307.716) |
2018-11-14 13:00:33 | 302 | 30788 | 14-11-2018 | 13:00:33 | 14-11-2018 | 13:00:33 | GSM | 14-11-2018 | 14:32:31 | 546258926 | ... | NaN | NaN | NaN | NaN | 17.0 | NaN | 32.690.669.041 | 6.057.171.253 | 2018-11-14 13:00:33 | POINT (690669.038 6057171.248) |
2018-11-14 13:30:09 | 303 | 30788 | 14-11-2018 | 13:30:09 | 14-11-2018 | 13:30:09 | GSM | 14-11-2018 | 14:32:31 | 546261709 | ... | NaN | NaN | NaN | NaN | 17.0 | NaN | 32.690.706.054 | 6.057.203.820 | 2018-11-14 13:30:09 | POINT (690706.056 6057203.814) |
5 rows × 48 columns
df.columns
Index(['No', 'CollarID', 'UTC_Date', 'UTC_Time', 'LMT_Date', 'LMT_Time', 'Origin', 'SCTS_Date', 'SCTS_Time', 'Latitude [?]', 'Longitude [?]', 'lat', 'long', 'FixType', 'Main [V]', 'Beacon [V]', 'Sats', 'Sat', 'C/N', 'Sat_1', 'C/N_1', 'Sat_2', 'C/N_2', 'Sat_3', 'C/N_3', 'Sat_4', 'C/N_4', 'Sat_5', 'C/N_5', 'Sat_6', 'C/N_6', 'Sat_7', 'C/N_7', 'Sat_8', 'C/N_8', 'Sat_9', 'C/N_9', 'Sat_10', 'C/N_10', 'Sat_11', 'C/N_11', 'Mort. Status', 'Temp [?C]', 'Activity', 'Easting', 'Northing', 'timestamp', 'geometry'], dtype='object')
df = df.drop(columns=['LMT_Date', 'LMT_Time',
'Origin', 'SCTS_Date', 'SCTS_Time', 'Latitude [?]', 'Longitude [?]',
'FixType', 'Main [V]', 'Beacon [V]', 'Sats', 'Sat',
'C/N', 'Sat_1', 'C/N_1', 'Sat_2', 'C/N_2', 'Sat_3', 'C/N_3', 'Sat_4',
'C/N_4', 'Sat_5', 'C/N_5', 'Sat_6', 'C/N_6', 'Sat_7', 'C/N_7', 'Sat_8',
'C/N_8', 'Sat_9', 'C/N_9', 'Sat_10', 'C/N_10', 'Sat_11', 'C/N_11',
'Easting', 'Northing',], axis=1)
df.head()
No | CollarID | UTC_Date | UTC_Time | lat | long | Mort. Status | Temp [?C] | Activity | timestamp | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|
t | |||||||||||
2018-11-14 12:01:08 | 299 | 30788 | 14-11-2018 | 12:01:08 | 54.743331 | 11.916987 | NaN | 22.0 | NaN | 2018-11-14 12:01:08 | POINT (687757.574 6070134.334) |
2018-11-14 12:15:09 | 300 | 30788 | 14-11-2018 | 12:15:09 | 54.676884 | 11.910876 | NaN | 22.0 | NaN | 2018-11-14 12:15:09 | POINT (687671.088 6062727.428) |
2018-11-14 12:30:08 | 301 | 30788 | 14-11-2018 | 12:30:08 | 54.627018 | 11.957852 | NaN | 21.0 | NaN | 2018-11-14 12:30:08 | POINT (690932.614 6057307.716) |
2018-11-14 13:00:33 | 302 | 30788 | 14-11-2018 | 13:00:33 | 54.625893 | 11.953686 | NaN | 17.0 | NaN | 2018-11-14 13:00:33 | POINT (690669.038 6057171.248) |
2018-11-14 13:30:09 | 303 | 30788 | 14-11-2018 | 13:30:09 | 54.626171 | 11.954280 | NaN | 17.0 | NaN | 2018-11-14 13:30:09 | POINT (690706.056 6057203.814) |
collar_id = df['CollarID'].unique()[0]
print("There is only one collar with ID {}.".format(collar_id))
There is only one collar with ID 30788.
df['Activity'].unique()
array([nan])
original_crs = df.crs
original_crs
<Projected CRS: EPSG:25832> Name: ETRS89 / UTM zone 32N Axis Info [cartesian]: - E[east]: Easting (metre) - N[north]: Northing (metre) Area of Use: - name: Europe between 6°E and 12°E: Austria; Belgium; Denmark - onshore and offshore; Germany - onshore and offshore; Norway including - onshore and offshore; Spain - offshore. - bounds: (6.0, 38.76, 12.0, 84.33) Coordinate Operation: - name: UTM zone 32N - method: Transverse Mercator Datum: European Terrestrial Reference System 1989 ensemble - Ellipsoid: GRS 1980 - Prime Meridian: Greenwich
1. Establishing an overview¶
The first step in our proposed EDA workflow can be performed directly on raw input data since it does not require temporally ordered data. It is therefore suitable as a first exploratory step when dealing with new data.
Q1.1 Geographic extent: Is the geographical extent as expected and are there holes in the spatial coverage?¶
df.to_crs({'init': 'epsg:4326'}).hvplot(title='Geographic extent of the dataset', geo=True, tiles='OSM')
The main area (the horste's pasture?) is located south of Nykobing Strandhuse.
However, we find also find two records on the road north west of the main area. Both points have been recorded on 2018-11-14 which is the first day of the dataset.
pd.DataFrame(df).sort_values('lat').tail(2)
No | CollarID | UTC_Date | UTC_Time | lat | long | Mort. Status | Temp [?C] | Activity | timestamp | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|
t | |||||||||||
2018-11-14 12:15:09 | 300 | 30788 | 14-11-2018 | 12:15:09 | 54.676884 | 11.910876 | NaN | 22.0 | NaN | 2018-11-14 12:15:09 | POINT (687671.088 6062727.428) |
2018-11-14 12:01:08 | 299 | 30788 | 14-11-2018 | 12:01:08 | 54.743331 | 11.916987 | NaN | 22.0 | NaN | 2018-11-14 12:01:08 | POINT (687757.574 6070134.334) |
A potential hypothesis for the origin of these two records is that the horse (or the collar) was transported on 2018-11-14, taking the road from Nykobing Falster south to the pasture.
If we remove these first two records from the dataset, the remainder of the records are located in a small area:
df = df[2:].to_crs({'init': 'epsg:4326'})
( df.hvplot(title='OSM showing paths and fences', size=2, geo=True, tiles='OSM') +
df.hvplot(title='Imagery showing land cover details', size=2, color='red', geo=True, tiles='EsriImagery') )