Bucketing Continuous Variables in pandas
In this post we look at bucketing (also known as binning) continuous data into discrete chunks to be used as ordinal categorical variables.
We’ll start by mocking up some fake data to use in our analysis.
We use random data from a normal distribution and a chi-square distribution.
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame({
'normal': np.random.normal(10, 3, 1000),
'chi': np.random.chisquare(4, 1000)
})
We can use the pandas function pd.cut()
to cut our data into 8 discrete buckets.
The result is a series with 8 categories.
pd.cut(df['normal'], 8).head()
We’ll now do the same for our second distribution.
pd.cut(df['chi'], 8).head()
Notice, however, that the buckets for the first distribution and the second distribution do not have the same start values or end values, and have different step sizes.
If we want, we can provide our own buckets by passing an array in as the second argument to the pd.cut()
function, with the array consisting of bucket cut-offs.
Let’s create an array of 8 buckets to use on both distributions:
custom_bucket_array = np.linspace(0, 20, 9)
custom_bucket_array
Now when we cut our data, we get buckets that are all the same range (2.5).
df['normal'] = pd.cut(df['normal'], custom_bucket_array)
df['chi'] = pd.cut(df['chi'], custom_bucket_array)
df.head()
We can then plot this data to show the distribution densities using the same buckets for both distributions.
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
a = df.groupby('normal').size()
b = df.groupby('chi').size()
categories = df['normal'].cat.categories
ind = np.array([x for x, _ in enumerate(categories)])
width = 0.35
plt.bar(ind, a, width, label='Normal')
plt.bar(ind + width, b, width,
label='Chi Square')
plt.xticks(ind + width / 2, categories)
plt.legend(loc='best')
plt.xticks(rotation = 90)
plt.show()