Splitting Data for Machine Learning with scikit-learn

scikit-learn provides a helpful function for partitioning data, train_test_split, which splits out your data into a training set and a test set.

In this post we’ll show how it works.

We’ll create some fake data and then split it up into test and train.

Let’s imagine our data is modelled as follows:

$$ \ y_i = \begin{cases} 1 & \quad \text{if } X_0 + X_1 \leq 10 \\ 0 & \quad \text{otherwise } \end{cases} \ $$

In [1]:

import pandas as pd
import numpy as np

np.random.seed(10)
X = pd.DataFrame({
    'x_0': np.random.randint(1, 11, 20),
    'x_1': np.random.randint(1, 11, 20)
})
y = (X['x_0'] + X['x_1']).map(lambda x: 1 if x <= 10 else 0)

So we start with 20 samples of random numbers between 1 and 10, this is our feature set.

In [2]:

print(X.head())

   x_0  x_1
0   10    5
1    5    2
2    1    4
3    2    7
4   10    6

Our response is the class (0, 1)

In [3]:

print(y.head())

0    0
1    1
2    1
3    1
4    0
dtype: int64

From these, we want to get a test and training set of data, so we can use our train_test_split.

We provide the proportion of data to use as a test set and we can provide the parameter random_state, which is a seed to ensure repeatable results.

In [4]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [5]:

len(X_train)

Out[5]:

In [6]:

len(y_train)

Out[6]:

In [7]:

len(X_test)

Out[7]:

In [8]:

len(y_test)

Out[8]:

In [9]:

X_test

Out[9]:

	x_0	x_1
3	2	7
16	7	9
6	2	10
10	9	10
2	1	4

In [10]:

y_test

Out[10]:

3     1
16    0
6     0
10    0
2     1
dtype: int64