Mapping Categorical Data in pandas
In python, unlike R, there is no option to represent categorical data as factors. Factors in R are stored as vectors of integer values and can be labelled.
If we have our data in Series or Data Frames, we can convert these categories to numbers using pandas Series’ astype
method and specify ‘categorical’.
Nominal Categories
Nominal categories are unordered e.g. colours, sex, nationality.
In the example below we categorise the Series vertebrates
of the df
dataframe into their individual categories.
By default the categories are ordered alphabetically, which is why in the example below Amphibian is represented by a zero.
import pandas as pd
df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})
df.vertebrates.astype("category").cat.codes
You can always pass the types of vertebrates in separately so you have a record of the labels to match the categories.
Any missing categories in this case will be represented by -1
vertebrate_types = ['Mammal', 'Reptile', 'Bird', 'Amphibian', 'Fish']
df.vertebrates.astype("category", categories=vertebrate_types).cat.codes
However, there is no inherent relationship between these categories so it doesn’t necessary make sense to store these as different numbers on the same scale.
If we wanted to separate the distinct variables out into booleans as we would like for data science models such as, for example, linear regression, we can use pd.get_dummies
.
pd.get_dummies(df, columns=['vertebrates'])
Ordinal Categories
Ordinal categories are ordered, e.g. school grades, price ranges, salary bands.
For ordinal categorical data, you pass the parameter ordered = True
to the astype
method.
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})
We can have the output categories as text, with NaN for any missing categories:
df.satisfaction.astype("category",
ordered=True,
categories=ordered_satisfaction
)
Or the output categories as numbers that map to the ordered categories. The number -1 is given to any missing category.
df.satisfaction.astype("category",
ordered=True,
categories=ordered_satisfaction
).cat.codes