Random Forests in python using scikit-learn
In this post we’ll be using the Parkinson’s data set available from UCI here to predict Parkinson’s status from potential predictors using Random Forests.
Decision trees are a great tool but they can often overfit the training set of data unless pruned effectively, hindering their predictive capabilities.
Random forests are an ensemble model of many decision trees, in which each tree will specialise its focus on a particular feature, while maintaining an overview of all features.
Each tree in the random forest will do its own random train/test split of the data, known as bootstrap aggregation and the samples not included are known as the ‘out-of-bag’ samples. Additionally each tree will do feature bagging at each node-branch split to lessen the effects of a feature that is highly correlated with the response.
While an individual tree might be sensitive to outliers, the ensemble model will likely not be.
The model predicts new labels by taking a majority vote from each of its trees given a new observation.