Principle Component Analysis in Python

  • Post author:
  • Post category:Python

Principle Component Analysis in Python

Principle component analysis (PCA) is an unsupervised statistical technique that is used for dimensionality reduction.

It turns possible correlated features into a set of linearly uncorrelated ones called ‘Principle Components’.

In this post we’ll be doing PCA on the pokemon data set.

In [1]:
import pandas as pd
from sklearn.decomposition import PCA

pokemon = pd.read_csv('data/pokemon.csv')
In [2]:
print(pokemon.head())
   #                   Name Type 1  Type 2  Total  HP  Attack  Defense  \
0  1              Bulbasaur  Grass  Poison    318  45      49       49   
1  2                Ivysaur  Grass  Poison    405  60      62       63   
2  3               Venusaur  Grass  Poison    525  80      82       83   
3  3  VenusaurMega Venusaur  Grass  Poison    625  80     100      123   
4  4             Charmander   Fire     NaN    309  39      52       43   

   Sp. Atk  Sp. Def  Speed  Generation Legendary  
0       65       65     45           1     False  
1       80       80     60           1     False  
2      100      100     80           1     False  
3      122      120     80           1     False  
4       60       50     65           1     False  

PCA is a good starting point for complex data. It models a linear subspace of the data by capturing the greatest variability. It does this by assessing the data’s covariance structure using matrix calculations and eigenvectors to compute the best unique features to describe the samples.

The first step finds the mean of the data, then search for the direction with the most variance. This direction is the principle component vectors, so it is added to a list. The next principle component is the orthogonal direction that has the next highest variance and so on.

image

This has a lot of practical uses including reducing the number of features you are working with for more processor intensive applications and noise reduction.

PCA is sensitive to the scale of features but, luckily for us on this occasion, our features are all of similar scale.

In [3]:
# Just take these features of interest
df = pokemon[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

df.describe()
Out[3]:
HP Attack Defense Sp. Atk Sp. Def Speed
count 800.000000 800.000000 800.000000 800.000000 800.000000 800.000000
mean 69.258750 79.001250 73.842500 72.820000 71.902500 68.277500
std 25.534669 32.457366 31.183501 32.722294 27.828916 29.060474
min 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000
25% 50.000000 55.000000 50.000000 49.750000 50.000000 45.000000
50% 65.000000 75.000000 70.000000 65.000000 70.000000 65.000000
75% 80.000000 100.000000 90.000000 95.000000 90.000000 90.000000
max 255.000000 190.000000 230.000000 194.000000 230.000000 180.000000

We will be reducing the features above down to just 2 principle components.

In [4]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, svd_solver='full')
pca.fit(df)
Out[4]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
  svd_solver='full', tol=0.0, whiten=False)
In [5]:
T = pca.transform(df)
In [6]:
# Started with 6 dimensions
df.shape
Out[6]:
(800, 6)
In [7]:
# Left with 2 principle components
T.shape
Out[7]:
(800, 2)
In [8]:
df.head()
Out[8]:
HP Attack Defense Sp. Atk Sp. Def Speed
0 45 49 49 65 65 45
1 60 62 63 80 80 60
2 80 82 83 100 100 80
3 80 100 123 122 120 80
4 39 52 43 60 50 65
In [9]:
T
Out[9]:
array([[ -45.86072754,   -5.38443151],
       [ -11.15293667,   -5.80561951],
       [  36.94600862,   -5.23612965],
       ..., 
       [  75.99988475,  -27.27078641],
       [ 114.0967126 ,  -36.87056714],
       [  72.88355049,   15.15261625]])

We can use the explained_variance_ratio_ method of our principle component analysis object to see how much of the variance is explained by each of our principle components vectors.

In [10]:
pca.explained_variance_ratio_
Out[10]:
array([ 0.46096131,  0.18752145])

So just two principle components can explain almost 65% of the variance from these 6 features.

Interpreting Components

We can access the correlations between the components and original variables using the components_ method of our PCA() object.

Interpretation of these relies on finding the most highly correlated components (for this example we’ll use a cut-off of 0.45)

In [11]:
components = pd.DataFrame(pca.components_, columns = df.columns, index=[1, 2])
components
Out[11]:
HP Attack Defense Sp. Atk Sp. Def Speed
1 0.300808 0.492892 0.380635 0.508981 0.394370 0.327263
2 0.042210 0.076545 0.695216 -0.383311 0.173894 -0.576079

So for the first principle component, Attack and Sp. Atk is significant so this principle component is correlated well with Attack and Sp. Atk and pokemon with a high value for the first principle component have high Attack and Sp. Atk.

For the second principle component, this will increase with an increase in Defense and a decrease in Speed. Pokemon with high values of the second principle component will have a high value for Defense but a low value for speed.

We can do some mathematics to find out which are the most important features:

In [12]:
import math

def get_important_features(transformed_features, components_, columns):
    """
    This function will return the most "important" 
    features so we can determine which have the most
    effect on multi-dimensional scaling
    """
    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    # Sort each column by it's length. These are your *original*
    # columns, not the principal components.
    important_features = { columns[i] : math.sqrt(xvector[i]**2 + yvector[i]**2) for i in range(num_columns) }
    important_features = sorted(zip(important_features.values(), important_features.keys()), reverse=True)
    print "Features by importance:\n", important_features

get_important_features(T, pca.components_, df.columns.values)
Features by importance:
[(143.62419952151768, 'Defense'), (119.74350606922016, 'Speed'), (105.83113958361301, 'Sp. Atk'), (76.02281561178808, 'Attack'), (68.1790434253425, 'Sp. Def'), (46.24128335926672, 'HP')]

We see that the most significant features for this PCA are Defence, Speed, Special Attack and Attack, as we saw when examining the components_ previously.

By plotting these lengths, we can see this visually:

In [13]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

def draw_vectors(transformed_features, components_, columns):
    """
    This funtion will project your *original* features
    onto your principal component feature-space, so that you can
    visualize how "important" each one was in the
    multi-dimensional scaling
    """

    num_columns = len(columns)

    # Scale the principal components by the max value in
    # the transformed set belonging to that component
    xvector = components_[0] * max(transformed_features[:,0])
    yvector = components_[1] * max(transformed_features[:,1])

    ax = plt.axes()

    for i in range(num_columns):
    # Use an arrow to project each original feature as a
    # labeled vector on your principal component axes
        plt.arrow(0, 0, xvector[i], yvector[i], color='b', width=0.0005, head_width=0.02, alpha=0.75)
        plt.text(xvector[i]*1.2, yvector[i]*1.2, list(columns)[i], color='b', alpha=0.75)

    return ax
In [14]:
ax = draw_vectors(T, pca.components_, df.columns.values)
T_df = pd.DataFrame(T)
T_df.columns = ['component1', 'component2']

T_df['color'] = 'y'
T_df.loc[T_df['component1'] > 125, 'color'] = 'g'
T_df.loc[T_df['component2'] > 125, 'color'] = 'r'

plt.xlabel('Principle Component 1')
plt.ylabel('Principle Component 2')
plt.scatter(T_df['component1'], T_df['component2'], color=T_df['color'], alpha=0.5)
plt.show()

We can see from the plot that all components are positive in the first principle component but speed and special attack in the second principle component are negative. Their lengths portray their magnitudes.

The pokemon in green have high values for the first principle component – They have high Attack and Sp. Atk

The pokemon in red have high values for the second principle component – They have high Defense and low Speed

In [15]:
# High Attack, High Sp. Atk, all of these pokemon are legendary
print(pokemon.loc[T_df[T_df['color'] == 'g'].index])
       #                   Name   Type 1    Type 2  Total   HP  Attack  \
163  150    MewtwoMega Mewtwo X  Psychic  Fighting    780  106     190   
164  150    MewtwoMega Mewtwo Y  Psychic       NaN    780  106     150   
422  382    KyogrePrimal Kyogre    Water       NaN    770  100     150   
424  383  GroudonPrimal Groudon   Ground      Fire    770  100     180   
426  384  RayquazaMega Rayquaza   Dragon    Flying    780  105     180   

     Defense  Sp. Atk  Sp. Def  Speed  Generation Legendary  
163      100      154      100    130           1      True  
164       70      194      120    140           1      True  
422       90      180      160     90           3      True  
424      160      150       90     90           3      True  
426      100      180      100    115           3      True  
In [16]:
# High Defense, Low Speed
print(pokemon.loc[T_df[T_df['color'] == 'r'].index])
       #                 Name Type 1  Type 2  Total  HP  Attack  Defense  \
224  208  SteelixMega Steelix  Steel  Ground    610  75     125      230   
230  213              Shuckle    Bug    Rock    505  20      10      230   
333  306    AggronMega Aggron  Steel     NaN    630  70     140      230   

     Sp. Atk  Sp. Def  Speed  Generation Legendary  
224       55       95     30           2     False  
230       10      230      5           2     False  
333       60       80     50           3     False