Ensemble algorithms
- Random forests is an ensemble algorithm.
- They combine multiple ML algorithms to create a more powerful one.
- In competitions, ensemble algorithms are usually the winners.
- Two most common ensemble algorithms are random forests and gradient boosted decision trees.
Random forests
- A random forest is a group of decision trees that slightly differ from one another.
- The idea is to build trees for different parts of the data. Each tree overfits a certain region. Averaging such trees can make a very powerful model that does not overfit.
- The different trees differ from one another in a random way.
- The randomness can be achieved by two means. a) using a random combination of samples with replacement (bootstrapping), b) randomly choosing a subset of features for each split.
- If number of random features equals number of features, then there will be no randomness in feature selection.
- If number of random features is very small, then the trees should be very deep to fit the data.
- For prediction, all trees make a prediction. In case of regression, the result will be their average. In case of classification, a majority vote determines the category.
Pros
- Very powerful.
- Usually works well without parameter tuning.
- Does not require feature scaling.
- Works well on large datasets.
- Can be parallelized
Cons
- Does not perform well on very high dimensional sparse data.
- Requires more memory than single algorithms and is slower to train and run.