I recently used k-dimensional trees to get decision boundaries in a very high-dimensional space and find the nearest neighbors for a given vector. I was curious to know what else these trees can be used for in machine learning. This led me to random forests and I read up on a couple blog posts to learn more about them. They seem like a really useful and robust way to approach classification problems and I’ve jotted down a quick summary of decision trees and random forests. Note that this isn’t a complete description of how these methods are defined mathematically; instead, I’m writing quick refresher notes for working with these techniques. Without further ado, here they are!
k-dimensional trees
- they divide n-dimensional feature space into regions
- we can use these trees for different types of prediction problems by finding which region a test feature lies in
- regression: assign mean of all train feature labels in that region
- classification: assign mode of all train feature labels in that region
Algorithm choices
- regression: choose splits that minimize the cost function, defined as the least squared error
- classification: different cost functions can be chosen for minimization — error rate, Gini index, cross-entropy error
- the latter two better for optimization algorithms as they have smooth curvature (easier to take derivatives)
- too many splits → can lead to overfitting (high variance) because fits the training data too well (low bias)
- can prevent by doing one of the following and then cross-validating:
- setting minimum number of training examples in each region
- setting maximum number of splits of the tree
Ensemble methods
- bagging/bootstrap aggregation: sample for a dataset of size N with replacement from the training set — so each dataset can have multiple copies of a training example — to get k such sets, then fit trees to each and average over the k trees
- random forests are an example of a bagged tree model, where again the bootstrapping + averaging is performed, but with a limit on how the trees split
- this is done by choosing a set number of random features over which to split at each level
- a good call because different features are used to make decision splits in different trees
- the number of random features can be optimized by using cross validation as long as we make sure that the features used have at least some predictive power
- another thing to keep in mind: the trees + their predictions should be relatively uncorrelated with each other
- since random forests consist of large number of relatively uncorrelated trees, this ensemble cancels out the error in individual trees and performs better than a single tree
- boosting: fit to full data set, but using shallow trees (to avoid overfitting) where each subsequent tree is fit to results of previous tree; this process is repeated m times
- regression: each new tree is fit to residuals y_new = y_actual – alpha * y_old, where y_new and y_old are the predicted labels, y_actual is the actual label and alpha is the learning rate
- small alpha leads to good fitting as learning is slower
- classification: each new tree is fit to same labels but with time- and sample- dependent weights
- this emphasizes training examples in under-sampled areas, and can help with cases with class imbalance
- bagged trees can be constructed in parallel while boosted trees are sequential; thus bagging is faster
- boosted forests have lower generalization error than bagged tree ensembles; thus boosting is better for higher accuracy
Advantages of trees and random forests
- trees are good for non-linear decision boundaries because of how they’re constructed; don’t need to apply non-linear transformation to input features for classification => less overfitting
- they are insensitive to number of features and robust against noisy/missing data
- there aren’t many hyper-parameters that need to be fit for trees or forests
- with ensemble methods, we can also average over multiple trees to reduce variance and hence, overfitting