paper

Random Forests

  • Authors:

📜 Abstract

This paper presents a new method for classification and regression which consists of a collection of tree-structured classifiers. Each tree is constructed using a bootstrap sample from the data. To classify a new object from an input vector, input the vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest). This method, called random forests, has error rates that compare favorably to Adaboost, but are more robust with respect to noise. The method is quite robust to overfitting. A method for using prox- imities in the forest is given that can be used for locating outliers, clustering and scaling. An example is given of random forests in a classifier combination methodology. Copyright information: Institute of Mathematical Statistics, 2001.

✨ Summary

Random forests, as introduced by Leo Breiman in 2001, is an ensemble learning method for classification and regression that builds multiple decision trees during training and outputs the mode of their predictions. The approach is robust to overfitting and noise, making it a powerful tool compared to other ensemble methods like Adaboost. The methodology involves constructing each tree using a bootstrap sample of the data and using random feature selection for splitting nodes. This advancement in machine learning has had significant impacts across multiple fields, including genomics, finance, and image processing, owing to its ability to handle large datasets with higher accuracy and lower variance than individual decision trees. It has also influenced algorithms for feature importance and methodologies for detecting outliers.

Further extending its impact, the paper became highly cited and inspired a significant amount of follow-up research in areas such as:

These citations demonstrate the widespread application and further exploration of random forests, highlighting its versatility and continued relevance in machine learning and data science.