Mapping Categorical Data in pandas

Mapping Categorical Data in pandas

In python, unlike R, there is no option to represent categorical data as factors. Factors in R are stored as vectors of integer values and can be labelled.

If we have our data in Series or Data Frames, we can convert these categories to numbers using pandas Series’ astype method and specify ‘categorical’.

Nominal Categories

Nominal categories are unordered e.g. colours, sex, nationality.

In the example below we categorise the Series vertebrates of the df dataframe into their individual categories.

By default the categories are ordered alphabetically, which is why in the example below Amphibian is represented by a zero.

In [1]:
import pandas as pd
In [2]:
df = pd.DataFrame({'vertebrates': ['Bird', 'Bird', 'Mammal', 'Fish', 'Amphibian', 'Reptile', 'Mammal']})

df.vertebrates.astype("category").cat.codes
Out[2]:
0    1
1    1
2    3
3    2
4    0
5    4
6    3
dtype: int8

You can always pass the types of vertebrates in separately so you have a record of the labels to match the categories.

Any missing categories in this case will be represented by -1

In [3]:
vertebrate_types = ['Mammal', 'Reptile', 'Bird', 'Amphibian', 'Fish']

df.vertebrates.astype("category", categories=vertebrate_types).cat.codes
Out[3]:
0    2
1    2
2    0
3    4
4    3
5    1
6    0
dtype: int8

However, there is no inherent relationship between these categories so it doesn’t necessary make sense to store these as different numbers on the same scale.

If we wanted to separate the distinct variables out into booleans as we would like for data science models such as, for example, linear regression, we can use pd.get_dummies.

In [4]:
pd.get_dummies(df, columns=['vertebrates'])
Out[4]:
vertebrates_Amphibian vertebrates_Bird vertebrates_Fish vertebrates_Mammal vertebrates_Reptile
0 0 1 0 0 0
1 0 1 0 0 0
2 0 0 0 1 0
3 0 0 1 0 0
4 1 0 0 0 0
5 0 0 0 0 1
6 0 0 0 1 0

Ordinal Categories

Ordinal categories are ordered, e.g. school grades, price ranges, salary bands.

For ordinal categorical data, you pass the parameter ordered = True to the astype method.

In [5]:
ordered_satisfaction = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
df = pd.DataFrame({'satisfaction':['Mad', 'Happy', 'Unhappy', 'Neutral']})

We can have the output categories as text, with NaN for any missing categories:

In [6]:
df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
)
Out[6]:
0        NaN
1      Happy
2    Unhappy
3    Neutral
Name: satisfaction, dtype: category
Categories (5, object): [Very Unhappy < Unhappy < Neutral < Happy < Very Happy]

Or the output categories as numbers that map to the ordered categories. The number -1 is given to any missing category.

In [7]:
df.satisfaction.astype("category",
  ordered=True,
  categories=ordered_satisfaction
).cat.codes
Out[7]:
0   -1
1    3
2    1
3    2
dtype: int8