Linear Regression in Python using scikit-learn
In this post, we’ll be exploring Linear Regression using scikit-learn in python.
We will use the physical attributes of a car to predict its miles per gallon (mpg).
Linear regression produces a model in the form:
$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 … + \beta_n X_n $
The way this is accomplished is by minimising the residual sum of squares, given by the equation below:
$ RSS = \Sigma^n_{i=1}(y_i – \hat{y}_i)^2 $
$ RSS = \Sigma^n_{i=1}(y_i – \hat{\beta_0} – \hat{\beta_{1}}x_1 – \hat{\beta_{2}}x_2 – … – \hat{\beta_{p}}x_p) $
Scikit-learn provides a LinearRegression
function for doing this job.
Let’s start by loading our data.
import pandas as pd
df = pd.read_csv('data/auto-mpg.csv')
print(df.head())
We don’t need the name column, so let’s remove this
df = df.drop('name', axis=1)
Also note that the column "origin"
is where the car came from and this is an ordinal categorical variable so we will need to create the dummy binary variables for this (See this post for more details).
df['origin'] = df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
df = pd.get_dummies(df, columns=['origin'])
print(df.head())
There are some missing values for horsepower, denoted by question marks so we’ll need to remove these
import numpy as np
df = df.replace('?', np.nan)
df = df.dropna()
Split Data
Now we can split our data into a training and test set:
X = df.drop('mpg', axis=1)
y = df[['mpg']]
from sklearn.model_selection import train_test_split
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
Train Model
We train our LinearRegression
model using the training set of data.
from sklearn.linear_model import LinearRegression
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
Now that our model is trained, we can view the coefficients of the model using regression_model.coef_
, which is an array of tuples of coefficients.
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][idx]))
regression_model.intercept_
contains an array of intercepts ($\beta_0$ values)
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
So we can write our linear model as:
$ Y = -19.81 – 0.25 \times X_1 + 0.02 \times X_2 – 0.01 \times X_3 – 0.01 \times X_4 + 0.22 \times X_5 + 0.78 \times X_6 – 1.76 \times X_7 + 0.81 \times X_8 + 0.95 \times X_9 $
Note that, because we’ve not done any feature scaling or dimensionality reduction, we can’t say anything about the relative importance of each of our features given these coefficients because the features are not of the same scale.
Scoring Model
A common method of measuring the accuracy of regression models is to use the $R^2$ statistic.
The $ R^2 $ statistic is defined as follows:
$ R^2 = 1 – \dfrac{RSS}{TSS} $
- The RSS (Residual sum of squares) measures the variability left unexplained after performing the regression
- The TSS measues the total variance in Y
- Therefore the $R^2$ statistic measures proportion of variability in Y that is explained by X using our model
$R^2$ can be determined using our test set and the model’s score
method.
regression_model.score(X_test, y_test)
So in our model, 82.85% of the variability in Y can be explained using X
We can also get the mean squared error using scikit-learn’s mean_squared_error
method and comparing the prediction for the test data set (data not used for training) with the ground truth for the data test set:
from sklearn.metrics import mean_squared_error
y_predict = regression_model.predict(X_test)
regression_model_mse = mean_squared_error(y_predict, y_test)
regression_model_mse
import math
math.sqrt(regression_model_mse)
So we are an average of 3.50 mpg away from the ground truth mpg when making predictions on our test set.
Making Predictions
We can use our model to predict the miles per gallon for another, unseen car. Let’s give it a go on the following:
- Cylinders – 4
- Displacement – 121
- Horsepower – 110
- Weight – 2800
- Acceleration – 15.4
- Year – 81
- Origin – Asia
regression_model.predict([[4, 121, 110, 2800, 15.4, 81, 0, 1, 0]])
The car above is the information for a Saab 900s and it turns out that this is quite close to the actual mpg of 26 mpg for this car.