I recently used *k*-dimensional trees to get decision boundaries in a very high-dimensional space and find the nearest neighbors for a given vector. I was curious to know what else these trees can be used for in machine learning. This led me to random forests and I read up on a couple blog posts to learn more about them. They seem like a really useful and robust way to approach classification problems and I’ve jotted down a quick summary of decision trees and random forests. Note that this isn’t a complete description of how these methods are defined mathematically; instead, I’m writing quick refresher notes for working with these techniques. Without further ado, here they are!

*k*-dimensional trees

- they divide
*n*-dimensional feature space into regions - we can use these trees for different types of prediction problems by finding which region a test feature lies in
- regression: assign mean of all train feature labels in that region
- classification: assign mode of all train feature labels in that region

#### Algorithm choices

- regression: choose splits that minimize the cost function, defined as the least squared error
- classification: different cost functions can be chosen for minimization — error rate, Gini index, cross-entropy error
- the latter two better for optimization algorithms as they have smooth curvature (easier to take derivatives)

- too many splits → can lead to overfitting (high variance) because fits the training data too well (low bias)
- can prevent by doing one of the following and then cross-validating:
- setting minimum number of training examples in each region
- setting maximum number of splits of the tree

### Ensemble methods

- bagging/bootstrap aggregation: sample for a dataset of size
*N*with replacement from the training set — so each dataset can have multiple copies of a training example — to get*k*such sets, then fit trees to each and average over the*k*trees - random forests are an example of a bagged tree model, where again the bootstrapping + averaging is performed, but with a limit on how the trees split
- this is done by choosing a set number of random features over which to split at each level

- a good call because different features are used to make decision splits in different trees
- the number of random features can be optimized by using cross validation as long as we make sure that the features used have at least some predictive power
- another thing to keep in mind: the trees + their predictions should be relatively uncorrelated with each other
- since random forests consist of large number of relatively uncorrelated trees, this ensemble cancels out the error in individual trees and performs better than a single tree

- boosting: fit to full data set, but using shallow trees (to avoid overfitting) where each subsequent tree is fit to results of previous tree; this process is repeated
*m*times- regression: each new tree is fit to residuals
*y_new = y_actual – alpha * y_old*, where*y_new*and*y_old*are the predicted labels,*y_actual*is the actual label and*alpha*is the learning rate - small
*alpha*leads to good fitting as learning is slower - classification: each new tree is fit to same labels but with time- and sample- dependent weights
- this emphasizes training examples in under-sampled areas, and can help with cases with class imbalance

- regression: each new tree is fit to residuals
- bagged trees can be constructed in parallel while boosted trees are sequential; thus bagging is faster
- boosted forests have lower generalization error than bagged tree ensembles; thus boosting is better for higher accuracy

### Advantages of trees and random forests

- trees are good for non-linear decision boundaries because of how they’re constructed; don’t need to apply non-linear transformation to input features for classification => less overfitting
- they are insensitive to number of features and robust against noisy/missing data
- there aren’t many hyper-parameters that need to be fit for trees or forests
- with ensemble methods, we can also average over multiple trees to reduce variance and hence, overfitting