Linear Regression in Python using scikit-learn

In this post, we’ll be exploring Linear Regression using scikit-learn in python.

We will use the physical attributes of a car to predict its miles per gallon (mpg).

Linear regression produces a model in the form:

$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 … + \beta_n X_n $

The way this is accomplished is by minimising the residual sum of squares, given by the equation below:

$ RSS = \Sigma^n_{i=1}(y_i – \hat{y}_i)^2 $

$ RSS = \Sigma^n_{i=1}(y_i – \hat{\beta_0} – \hat{\beta_{1}}x_1 – \hat{\beta_{2}}x_2 – … – \hat{\beta_{p}}x_p) $

Scikit-learn provides a LinearRegression function for doing this job.

Let’s start by loading our data.

In [1]:

import pandas as pd

df = pd.read_csv('data/auto-mpg.csv')

print(df.head())

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin                       name  
0       1  chevrolet chevelle malibu  
1       1          buick skylark 320  
2       1         plymouth satellite  
3       1              amc rebel sst  
4       1                ford torino

We don’t need the name column, so let’s remove this

In [2]:

df = df.drop('name', axis=1)

Also note that the column "origin" is where the car came from and this is an ordinal categorical variable so we will need to create the dummy binary variables for this (See this post for more details).

In [3]:

df['origin'] = df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})

df = pd.get_dummies(df, columns=['origin'])

print(df.head())

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0        130    3504          12.0    70   
1  15.0          8         350.0        165    3693          11.5    70   
2  18.0          8         318.0        150    3436          11.0    70   
3  16.0          8         304.0        150    3433          12.0    70   
4  17.0          8         302.0        140    3449          10.5    70   

   origin_america  origin_asia  origin_europe  
0               1            0              0  
1               1            0              0  
2               1            0              0  
3               1            0              0  
4               1            0              0

There are some missing values for horsepower, denoted by question marks so we’ll need to remove these

In [4]:

import numpy as np

df = df.replace('?', np.nan)
df = df.dropna()

Split Data

Now we can split our data into a training and test set:

In [5]:

X = df.drop('mpg', axis=1)
y = df[['mpg']]

from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

Train Model

We train our LinearRegression model using the training set of data.

In [6]:

from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

Out[6]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that our model is trained, we can view the coefficients of the model using regression_model.coef_, which is an array of tuples of coefficients.

In [7]:

for idx, col_name in enumerate(X_train.columns):
    print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))

The coefficient for cylinders is -0.2463375587
The coefficient for displacement is 0.0238703383071
The coefficient for horsepower is -0.00601723861777
The coefficient for weight is -0.0073364329439
The coefficient for acceleration is 0.218977781041
The coefficient for year is 0.785180107278
The coefficient for origin_america is -1.76249340922
The coefficient for origin_asia is 0.809626919086
The coefficient for origin_europe is 0.952866490134

regression_model.intercept_ contains an array of intercepts ($\beta_0$ values)

In [8]:

intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

The intercept for our model is -19.8091838488

So we can write our linear model as:

$ Y = -19.81 – 0.25 \times X_1 + 0.02 \times X_2 – 0.01 \times X_3 – 0.01 \times X_4 + 0.22 \times X_5 + 0.78 \times X_6 – 1.76 \times X_7 + 0.81 \times X_8 + 0.95 \times X_9 $

Note that, because we’ve not done any feature scaling or dimensionality reduction, we can’t say anything about the relative importance of each of our features given these coefficients because the features are not of the same scale.

Scoring Model

A common method of measuring the accuracy of regression models is to use the $R^2$ statistic.

The $ R^2 $ statistic is defined as follows:

$ R^2 = 1 – \dfrac{RSS}{TSS} $

The RSS (Residual sum of squares) measures the variability left unexplained after performing the regression
The TSS measues the total variance in Y
Therefore the $R^2$ statistic measures proportion of variability in Y that is explained by X using our model

$R^2$ can be determined using our test set and the model’s score method.

In [9]:

regression_model.score(X_test, y_test)

Out[9]:

0.82852313164597702

So in our model, 82.85% of the variability in Y can be explained using X

We can also get the mean squared error using scikit-learn’s mean_squared_error method and comparing the prediction for the test data set (data not used for training) with the ground truth for the data test set:

In [10]:

from sklearn.metrics import mean_squared_error

y_predict = regression_model.predict(X_test)

regression_model_mse = mean_squared_error(y_predict, y_test)

regression_model_mse

Out[10]:

12.230963834602681

In [11]:

import math

math.sqrt(regression_model_mse)

Out[11]:

3.497279490490098

So we are an average of 3.50 mpg away from the ground truth mpg when making predictions on our test set.

Making Predictions

We can use our model to predict the miles per gallon for another, unseen car. Let’s give it a go on the following:

Cylinders – 4
Displacement – 121
Horsepower – 110
Weight – 2800
Acceleration – 15.4
Year – 81
Origin – Asia

In [12]:

regression_model.predict([[4, 121, 110, 2800, 15.4, 81, 0, 1, 0]])

Out[12]:

array([[ 28.6713418]])

The car above is the information for a Saab 900s and it turns out that this is quite close to the actual mpg of 26 mpg for this car.