Principle Component Analysis in Python
Principle component analysis (PCA) is an unsupervised statistical technique that is used for dimensionality reduction.
It turns possible correlated features into a set of linearly uncorrelated ones called ‘Principle Components’.
In this post we’ll be doing PCA on the pokemon data set.
import pandas as pd
from sklearn.decomposition import PCA
pokemon = pd.read_csv('data/pokemon.csv')
print(pokemon.head())
PCA is a good starting point for complex data. It models a linear subspace of the data by capturing the greatest variability. It does this by assessing the data’s covariance structure using matrix calculations and eigenvectors to compute the best unique features to describe the samples.
The first step finds the mean of the data, then search for the direction with the most variance. This direction is the principle component vectors, so it is added to a list. The next principle component is the orthogonal direction that has the next highest variance and so on.
This has a lot of practical uses including reducing the number of features you are working with for more processor intensive applications and noise reduction.
PCA is sensitive to the scale of features but, luckily for us on this occasion, our features are all of similar scale.
# Just take these features of interest
df = pokemon[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
df.describe()
We will be reducing the features above down to just 2 principle components.
from sklearn.decomposition import PCA
pca = PCA(n_components=2, svd_solver='full')
pca.fit(df)
T = pca.transform(df)
# Started with 6 dimensions
df.shape
# Left with 2 principle components
T.shape
df.head()
T
We can use the explained_variance_ratio_
method of our principle component analysis object to see how much of the variance is explained by each of our principle components vectors.
pca.explained_variance_ratio_
So just two principle components can explain almost 65% of the variance from these 6 features.
Interpreting Components
We can access the correlations between the components and original variables using the components_
method of our PCA()
object.
Interpretation of these relies on finding the most highly correlated components (for this example we’ll use a cut-off of 0.45)
components = pd.DataFrame(pca.components_, columns = df.columns, index=[1, 2])
components
So for the first principle component, Attack and Sp. Atk is significant so this principle component is correlated well with Attack and Sp. Atk and pokemon with a high value for the first principle component have high Attack and Sp. Atk.
For the second principle component, this will increase with an increase in Defense and a decrease in Speed. Pokemon with high values of the second principle component will have a high value for Defense but a low value for speed.
We can do some mathematics to find out which are the most important features:
import math
def get_important_features(transformed_features, components_, columns):
"""
This function will return the most "important"
features so we can determine which have the most
effect on multi-dimensional scaling
"""
num_columns = len(columns)
# Scale the principal components by the max value in
# the transformed set belonging to that component
xvector = components_[0] * max(transformed_features[:,0])
yvector = components_[1] * max(transformed_features[:,1])
# Sort each column by it's length. These are your *original*
# columns, not the principal components.
important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
print "Features by importance:\n", important_features
get_important_features(T, pca.components_, df.columns.values)
We see that the most significant features for this PCA are Defence, Speed, Special Attack and Attack, as we saw when examining the components_
previously.
By plotting these lengths, we can see this visually:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
def draw_vectors(transformed_features, components_, columns):
"""
This funtion will project your *original* features
onto your principal component feature-space, so that you can
visualize how "important" each one was in the
multi-dimensional scaling
"""
num_columns = len(columns)
# Scale the principal components by the max value in
# the transformed set belonging to that component
xvector = components_[0] * max(transformed_features[:,0])
yvector = components_[1] * max(transformed_features[:,1])
ax = plt.axes()
for i in range(num_columns):
# Use an arrow to project each original feature as a
# labeled vector on your principal component axes
plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)
return ax
ax = draw_vectors(T, pca.components_, df.columns.values)
T_df = pd.DataFrame(T)
T_df.columns = ['component1', 'component2']
T_df['color'] = 'y'
T_df.loc[T_df['component1'] > 125, 'color'] = 'g'
T_df.loc[T_df['component2'] > 125, 'color'] = 'r'
plt.xlabel('Principle Component 1')
plt.ylabel('Principle Component 2')
plt.scatter(T_df['component1'], T_df['component2'], color=T_df['color'], alpha=0.5)
plt.show()
We can see from the plot that all components are positive in the first principle component but speed and special attack in the second principle component are negative. Their lengths portray their magnitudes.
The pokemon in green have high values for the first principle component – They have high Attack and Sp. Atk
The pokemon in red have high values for the second principle component – They have high Defense and low Speed
# High Attack, High Sp. Atk, all of these pokemon are legendary
print(pokemon.loc[T_df[T_df['color'] == 'g'].index])
# High Defense, Low Speed
print(pokemon.loc[T_df[T_df['color'] == 'r'].index])