Random Forests in python using scikit-learn
In this post we’ll be using the Parkinson’s data set available from UCI here to predict Parkinson’s status from potential predictors using Random Forests.
Decision trees are a great tool but they can often overfit the training set of data unless pruned effectively, hindering their predictive capabilities.
Random forests are an ensemble model of many decision trees, in which each tree will specialise its focus on a particular feature, while maintaining an overview of all features.
Each tree in the random forest will do its own random train/test split of the data, known as bootstrap aggregation and the samples not included are known as the ‘out-of-bag’ samples. Additionally each tree will do feature bagging at each node-branch split to lessen the effects of a feature that is highly correlated with the response.
While an individual tree might be sensitive to outliers, the ensemble model will likely not be.
The model predicts new labels by taking a majority vote from each of its trees given a new observation.
We’ll start by loading up our data
import pandas as pd
df = pd.read_csv('data/parkinsons.data')
print(df.head())
We don’t want the name
feature in our DataFrame so we’ll drop this and split our data into features and labels.
X = df.drop('status', axis=1)
X = X.drop('name', axis=1)
y = df['status']
Now we can split the data into a training and test set of data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
We now create and train our model. The number of estimators (n_estimators
) determines how dense our decision forest is and the random_state
is given for reproducibility.
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=30, max_depth=10, random_state=1)
random_forest.fit(X_train, y_train)
Now we evaluate our model on our test set.
from sklearn.metrics import accuracy_score
y_predict = random_forest.predict(X_test)
accuracy_score(y_test, y_predict)
from sklearn.metrics import confusion_matrix
pd.DataFrame(
confusion_matrix(y_test, y_predict),
columns=['Predicted Healthy', 'Predicted Parkinsons'],
index=['True Healthy', 'True Parkinsons']
)
Our Decision Forest performs well on this limited set of data. We would need more data and more domain knowledge to effectively evaluate this model.