Notes on Random Forests

I recently used k-dimensional trees to get decision boundaries in a very high-dimensional space and find the nearest neighbors for a given vector. I was curious to know what else these trees can be used for in machine learning. This led me to random forests and I read up on a couple blog posts to learn more about them. They seem like a really useful and robust way to approach classification problems and I’ve jotted down a quick summary of decision trees and random forests. Note that this isn’t a complete description of how these methods are defined mathematically; instead, I’m writing quick refresher notes for working with these techniques. Without further ado, here they are!

Trees on my hike in Santa Cruz

k-dimensional trees

  • they divide n-dimensional feature space into regions
  • we can use these trees for different types of prediction problems by finding which region a test feature lies in
    • regression: assign mean of all train feature labels in that region
    • classification: assign mode of all train feature labels in that region

Algorithm choices

  • regression: choose splits that minimize the cost function, defined as the least squared error
  • classification: different cost functions can be chosen for minimization — error rate, Gini index, cross-entropy error
    • the latter two better for optimization algorithms as they have smooth curvature (easier to take derivatives)
  • too many splits → can lead to overfitting (high variance) because fits the training data too well (low bias)
  • can prevent by doing one of the following and then cross-validating:
    • setting minimum number of training examples in each region
    • setting maximum number of splits of the tree

Ensemble methods

  • bagging/bootstrap aggregation: sample for a dataset of size N with replacement from the training set — so each dataset can have multiple copies of a training example — to get k such sets, then fit trees to each and average over the k trees
  • random forests are an example of a bagged tree model, where again the bootstrapping + averaging is performed, but with a limit on how the trees split
    • this is done by choosing a set number of random features over which to split at each level
    • a good call because different features are used to make decision splits in different trees
    • the number of random features can be optimized by using cross validation as long as we make sure that the features used have at least some predictive power
    • another thing to keep in mind: the trees + their predictions should be relatively uncorrelated with each other
    • since random forests consist of large number of relatively uncorrelated trees, this ensemble cancels out the error in individual trees and performs better than a single tree
  • boosting: fit to full data set, but using shallow trees (to avoid overfitting) where each subsequent tree is fit to results of previous tree; this process is repeated m times
    • regression: each new tree is fit to residuals y_new = y_actual – alpha * y_old, where y_new and y_old are the predicted labels, y_actual is the actual label and alpha is the learning rate
    • small alpha leads to good fitting as learning is slower
    • classification: each new tree is fit to same labels but with time- and sample- dependent weights
    • this emphasizes training examples in under-sampled areas, and can help with cases with class imbalance
  • bagged trees can be constructed in parallel while boosted trees are sequential; thus bagging is faster
  • boosted forests have lower generalization error than bagged tree ensembles; thus boosting is better for higher accuracy

Advantages of trees and random forests

  • trees are good for non-linear decision boundaries because of how they’re constructed; don’t need to apply non-linear transformation to input features for classification => less overfitting
  • they are insensitive to number of features and robust against noisy/missing data
  • there aren’t many hyper-parameters that need to be fit for trees or forests
  • with ensemble methods, we can also average over multiple trees to reduce variance and hence, overfitting
The view from LACMA in Los Angeles