Ensemble algorithms

  • Random forests is an ensemble algorithm.
  • They combine multiple ML algorithms to create a more powerful one.
  • In competitions, ensemble algorithms are usually the winners.
  • Two most common ensemble algorithms are random forests and gradient boosted decision trees.

Random forests

  • A random forest is a group of decision trees that slightly differ from one another.
  • The idea is to build trees for different parts of the data. Each tree overfits a certain region. Averaging such trees can make a very powerful model that does not overfit.
  • The different trees differ from one another in a random way.
  • The randomness can be achieved by two means. a) using a random combination of samples with replacement (bootstrapping), b) randomly choosing a subset of features for each split.
  • If number of random features equals number of features, then there will be no randomness in feature selection.
  • If number of random features is very small, then the trees should be very deep to fit the data.
  • For prediction, all trees make a prediction. In case of regression, the result will be their average. In case of classification, a majority vote determines the category.

Pros

  • Very powerful.
  • Usually works well without parameter tuning.
  • Does not require feature scaling.
  • Works well on large datasets.
  • Can be parallelized

Cons

  • Does not perform well on very high dimensional sparse data.
  • Requires more memory than single algorithms and is slower to train and run.