Resampling time series data with pandas

In this post, we’ll be going through an example of resampling time series data using pandas.

We’re going to be tracking a self-driving car at 15 minute periods over a year and creating weekly and yearly summaries.

Let’s start by importing some dependencies:

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
%matplotlib inline

We’ll be tracking this self-driving car that travels at an average speed between 0 and 60 mph, all day long, all year long.

We have the average speed over the fifteen minute period in miles per hour, distance in miles and the cumulative distance travelled.

Our time series is set to be the index of a pandas DataFrame.

In [2]:

range = pd.date_range('2015-01-01', '2015-12-31', freq='15min')
df = pd.DataFrame(index = range)

# Average speed in miles per hour
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
# Distance in miles (speed * 0.5 hours)
df['distance'] = df['speed'] * 0.25 
# Cumulative distance travelled
df['cumulative_distance'] = df.distance.cumsum()

Let’s take a look at our data:

In [3]:

df.head()

Out[3]:

	speed	distance	cumulative_distance
2015-01-01 00:00:00	9	2.25	2.25
2015-01-01 00:15:00	24	6.00	8.25
2015-01-01 00:30:00	42	10.50	18.75
2015-01-01 00:45:00	22	5.50	24.25
2015-01-01 01:00:00	13	3.25	27.50

Now, let’s try and plot this data:

In [4]:

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.plot(df.index, df['speed'], 'g-')
ax2.plot(df.index, df['distance'], 'b-')

ax1.set_xlabel('Date')
ax1.set_ylabel('Speed', color='g')
ax2.set_ylabel('Distance', color='b')


plt.show()
plt.rcParams['figure.figsize'] = 12,5

Oh dear… Not very pretty, far too many data points.

Let’s start resampling, we’ll start with a weekly summary.

The resample method in pandas is similar to its groupby method as you are essentially grouping by a certain time span. You then specify a method of how you would like to resample.

So we’ll start with resampling the speed of our car:

df.speed.resample() will be used to resample the speed column of our DataFrame
The 'W' indicates we want to resample by week. At the bottom of this post is a summary of different time frames.
mean() is used to indicate we want the mean speed during this period.

With distance, we want the sum of the distances over the week to see how far the car travelled over the week, in that case we use sum().

With cumulative distance we just want to take the last value as it’s a running cumulative total, so in that case we use last().

In [5]:

weekly_summary = pd.DataFrame()
weekly_summary['speed'] = df.speed.resample('W').mean()
weekly_summary['distance'] = df.distance.resample('W').sum()
weekly_summary['cumulative_distance'] = df.cumulative_distance.resample('W').last()

#Select only whole weeks
weekly_summary = weekly_summary.truncate(before='2015-01-05', after='2015-12-27')
weekly_summary.head()

Out[5]:

	speed	distance	cumulative_distance
2015-01-11	29.549107	4964.25	7738.50
2015-01-18	29.938988	5029.75	12768.25
2015-01-25	28.837798	4844.75	17613.00
2015-02-01	29.653274	4981.75	22594.75
2015-02-08	29.197917	4905.25	27500.00

Now we have weekly summary data. Let’s have a look at our plots now.

In [6]:

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()
ax1.plot(weekly_summary.index, weekly_summary['speed'], 'g-')
ax2.plot(weekly_summary.index, weekly_summary['distance'], 'b-')

ax1.set_xlabel('Date')
ax1.set_ylabel('Speed', color='g')
ax2.set_ylabel('Distance', color='b')

plt.show()
plt.rcParams['figure.figsize'] = 12,5

Much better

We can do the same thing for an annual summary:

In [7]:

annual_summary = pd.DataFrame()
# AS is year-start frequency
annual_summary['speed'] = df.speed.resample('AS').mean()
annual_summary['distance'] = df.speed.resample('AS').sum()
annual_summary['cumulative_distance'] = df.cumulative_distance.resample('AS').last()
annual_summary

Out[7]:

	speed	distance	cumulative_distance
2015-01-01	29.489884	1030524	257631.0

Upsampling data

How about if we wanted 5 minute data from our 15 minute data?

In this case we would want to forward fill our speed data, for this we can use ffil() or pad. Our distance and cumulative_distance column could then be recalculated on these values.

If we wanted to fill on the next value, rather than the previous value, we could use backward fill bfill().

In [8]:

five_minutely_data = pd.DataFrame()
five_minutely_data['speed'] = df.speed.resample('5min').ffill()
# 5 minutes is 1/12 hours
five_minutely_data['distance'] = five_minutely_data['speed'] * (1/float(12))
five_minutely_data['cumulative_distance'] = five_minutely_data.distance.cumsum()

In [9]:

five_minutely_data.head()

Out[9]:

	speed	distance	cumulative_distance
2015-01-01 00:00:00	9	0.75	0.75
2015-01-01 00:05:00	9	0.75	1.50
2015-01-01 00:10:00	9	0.75	2.25
2015-01-01 00:15:00	24	2.00	4.25
2015-01-01 00:20:00	24	2.00	6.25

Resampling options

pandas comes with many in-built options for resampling, and you can even define your own methods.

In terms of date ranges, the following is a table for common time period options when resampling a time series:

Alias	Description
B	Business day
D	Calendar day
W	Weekly
M	Month end
Q	Quarter end
A	Year end
BA	Business year end
AS	Year start
H	Hourly frequency
T, min	Minutely frequency
S	Secondly frequency
L, ms	Millisecond frequency
U, us	Microsecond frequency
N, ns	Nanosecond frequency

These are some of the common methods you might use for resampling:

Method	Description
bfill	Backward fill
count	Count of values
ffill	Forward fill
first	First valid data value
last	Last valid data value
max	Maximum data value
mean	Mean of values in time range
median	Median of values in time range
min	Minimum data value
nunique	Number of unique values
ohlc	Opening value, highest value, lowest value, closing value
pad	Same as forward fill
std	Standard deviation of values
sum	Sum of values
var	Variance of values