Splitting Data for Machine Learning with scikit-learn
scikit-learn provides a helpful function for partitioning data, train_test_split
, which splits out your data into a training set and a test set.
In this post we’ll show how it works.
We’ll create some fake data and then split it up into test and train.
Let’s imagine our data is modelled as follows:
$$ \ y_i = \begin{cases} 1 & \quad \text{if } X_0 + X_1 \leq 10 \\ 0 & \quad \text{otherwise } \end{cases} \ $$
import pandas as pd
import numpy as np
np.random.seed(10)
X = pd.DataFrame({
'x_0': np.random.randint(1, 11, 20),
'x_1': np.random.randint(1, 11, 20)
})
y = (X['x_0'] + X['x_1']).map(lambda x: 1 if x <= 10 else 0)
So we start with 20 samples of random numbers between 1 and 10, this is our feature set.
print(X.head())
Our response is the class (0, 1)
print(y.head())
From these, we want to get a test and training set of data, so we can use our train_test_split
.
We provide the proportion of data to use as a test set and we can provide the parameter random_state
, which is a seed to ensure repeatable results.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
len(X_train)
len(y_train)
len(X_test)
len(y_test)
X_test
y_test