# Notes on Random Forests

I recently used k-dimensional trees to get decision boundaries in a very high-dimensional space and find the nearest neighbors for a given vector. I was curious to know what else these trees can be used for in machine learning. This led me to random forests and I read up on a couple blog posts to learn more about them. They seem like a really useful and robust way to approach classification problems and I’ve jotted down a quick summary of decision trees and random forests. Note that this isn’t a complete description of how these methods are defined mathematically; instead, I’m writing quick refresher notes for working with these techniques. Without further ado, here they are!

### k-dimensional trees

• they divide n-dimensional feature space into regions
• we can use these trees for different types of prediction problems by finding which region a test feature lies in
• regression: assign mean of all train feature labels in that region
• classification: assign mode of all train feature labels in that region

#### Algorithm choices

• regression: choose splits that minimize the cost function, defined as the least squared error
• classification: different cost functions can be chosen for minimization — error rate, Gini index, cross-entropy error
• the latter two better for optimization algorithms as they have smooth curvature (easier to take derivatives)
• too many splits → can lead to overfitting (high variance) because fits the training data too well (low bias)
• can prevent by doing one of the following and then cross-validating:
• setting minimum number of training examples in each region
• setting maximum number of splits of the tree

### Ensemble methods

• bagging/bootstrap aggregation: sample for a dataset of size N with replacement from the training set — so each dataset can have multiple copies of a training example — to get k such sets, then fit trees to each and average over the k trees
• random forests are an example of a bagged tree model, where again the bootstrapping + averaging is performed, but with a limit on how the trees split
• this is done by choosing a set number of random features over which to split at each level
• a good call because different features are used to make decision splits in different trees
• the number of random features can be optimized by using cross validation as long as we make sure that the features used have at least some predictive power
• another thing to keep in mind: the trees + their predictions should be relatively uncorrelated with each other
• since random forests consist of large number of relatively uncorrelated trees, this ensemble cancels out the error in individual trees and performs better than a single tree
• boosting: fit to full data set, but using shallow trees (to avoid overfitting) where each subsequent tree is fit to results of previous tree; this process is repeated m times
• regression: each new tree is fit to residuals y_new = y_actual – alpha * y_old, where y_new and y_old are the predicted labels, y_actual is the actual label and alpha is the learning rate
• small alpha leads to good fitting as learning is slower
• classification: each new tree is fit to same labels but with time- and sample- dependent weights
• this emphasizes training examples in under-sampled areas, and can help with cases with class imbalance
• bagged trees can be constructed in parallel while boosted trees are sequential; thus bagging is faster
• boosted forests have lower generalization error than bagged tree ensembles; thus boosting is better for higher accuracy

### Advantages of trees and random forests

• trees are good for non-linear decision boundaries because of how they’re constructed; don’t need to apply non-linear transformation to input features for classification => less overfitting
• they are insensitive to number of features and robust against noisy/missing data
• there aren’t many hyper-parameters that need to be fit for trees or forests
• with ensemble methods, we can also average over multiple trees to reduce variance and hence, overfitting