Introduction to Random Forests

This text is a short extract of my research activities and can be seen as a brief introduction to the topic. If you have any questions or suggestions don’t hesitate to contact me.

Random Forests are an ensemble of separately trained binary decision trees. These decision trees are trained to solve a problem together optimally. For that, the predictions of all trees (named as votes) are combined together to a final vote. Maximizing the information gain and minimizing the information entropy are the goals of the training of these trees to optimally separate the data points or predict a continuous variable.

The decision tree concept was described by Leo Breiman et al. [1] in 1984. After that, more and more applications used an ensemble of randomly trained decision trees for machine learning. One of the first researchers was Shapire[2] in 1990. He used an ensemble to solve machine learning tasks with the Boosting algorithm. As a result of his work, he found out that an ensemble of (weak) learners achieved a significant higher generalization. T. K. Ho was the first who used decision forests to implement a handwriting recognition system[3] in 1995.

More complex applications were implemented by Amit and Geman in 1997 with a shape recognition system [4] . Since then, Random Forests were especially used for computer vision tasks but mostly in the medical imaging field. See [5] and [6] for some prime examples for Random Forests in medical imaging applications.

Below, a list of possible problems that Random Forests can solve is shown:

  • Classification: Prediction of class for specific data
  • Regression: Predicting a continuous variable
  • Density: Learn a probability density function
  • Manifolds: Learn a set of manifolds with corresponding variables

Furthermore, the Random Forest can be used to learn with minimal manual annotations which results in active learning. Semi-supervised learning can also be realized with this concept. This document covers the classification and regression aspect of Random Forests. For literature on density estimation see [7] and [8]. Furthermore, information on manifold learning on Random Forests can be found in [9] and [10].

Beside the areas of application, every Random Forest can be described by the following key parameters:

  • Tree size T
  • Tree depth D
  • Choice of weak learner model and corresponding training objective function (energy function)
  • Injected randomness influenced by p

To understand the procedure of Random Forests it is necessary to be familiar with the following concepts:

Weak learner model

The Random Forest uses weak classifiers to solve its tasks. A weak classifier is specialized on a sub problem and significant faster than a more complex strong classifier. This classifier is often used with other weak learners to tackle a complex problem. The main advantage of this type of learner model in an ensemble of other weak learners is the significant better generalization performance. It performs significant better on unseen data than a stronger learner. The Boosting algorithm is used to generate a strong learner out of an ensemble of weak learners.

Training objective function/Energy function

This function is optimized for each split node (not a leaf) and its specific parameters during the training process. This means a minimization of the information entropy H(S) and a maximization of the information gain I. The maximization of the information gain can be seen as the search for the best separation of two datasets. The (weak) learner improves its parameters to separate the data more efficiently.

Ensemble learning

This concept of learning combines an ensemble of separately trained learners together. During the training step, each learner is trained separately on a random set of training examples. This leads to a significant better generalization. The prediction of this ensemble can be calculated in a few ways. The common ways are to apply a mean function over the predictions of all learners or a mean function with separate weights (confidence) for each learner. The Random Forest concept uses ensemble learning to compute predictions.

Decision Tree

A decision tree is a hierarchical data structure and often used to split a complex problem into sub problems. The root node represents the entry point of the tree. It has successor nodes (also called child nodes). These children can also have successors. A node with no successor is named leaf. In nearly all applications of Random Forests binary decision trees are used. This means that each node (except the leafs) has exactly two successors.

  • Root node: Entry point of the tree that has two successors. It acts like a regular node.
  • Regular node: Has one previous node and two successors. Each regular node applies a split function (described later) with a binary return value to the data. The result of the split function decides to forward the data to the left or the right child. These nodes are also called split nodes and implement the weak learner model.
  • Leaf: After passing a few regular nodes the data is forwarded to a leaf. A leaf has no successors and provides a confidence of class affiliation or a predicted function value. Leafs implement the predictor model.

The main effort in setting up a well-working decision tree is to determine the split function for each node wisely. In some applications, for instance computer vision, the split function is always the same and parameters as well as the tree structure are learned from training data. Note that n-ary decision trees also exist, but these type of trees are more complex and not suitable for this problem.

Information Gain

To reach an exact separation of classes of data points the information gain I must be maximized. This can be reached with minimizing the Shannon entropy H(S) which is defined as:

(#1)

where S is denoted as an example dataset, c is a class with c ∈ C and p(c) is the probability mass function of each class.

(#2)

Shannon entropy[11] describes the uncertainty of an information source in combination with a random variable. In other words it can be seen as the unpredictability of an information source. The two equations #1 and #2 show possible calculations of the information entropy. The entropy of discrete and categorical distributions can be calculated with #1. For a continuous distribution #2 is used. This works for the differential entropy of a d-variate gaussian distribution. Equation #1 is used in equation #3 to calculate the information gain.

(#3)

where H(S) is the Shannon entropy of all data points reaching a specific node, H(Si) for the i-th of the both children {1, 2}. Additionally |S| represents the complete training data points at this node and |Si| all data points that will be separated to node i in {1, 2}H(Si) calculates the Shannon entropy of all data points for node Si. The goal of the training is to maximize the information gain.

During the training each node in the decision tree tries to find an optimal separation of incoming data points. The indication of the training process are the information gain and entropy. Minimizing the information entropy (unpredictability) H(S) and maximization the information gain I (confidence) are the main aims of each node.

Weak Learner Model

A weak learner model is often used in ensemble learning models like the Random Forest. Boosting (and algorithms similar to Boosting) uses weak learners to build a strong classifier out of a bunch of weak classifiers. For more information about Boosting and the existence of weak learner see Mannor et al. [12].

The weak learner model in Random Forests with binary decision trees is defined with a binary test function, that returns “true” or “false” which equals the forwarding of the test data to the “right” or “left” child. The weak learner function is selected during the conceptional stage before the training starts. In most Random Forests, the decision trees are always binary trees. This leads to a binary test function, because each split node needs a binary decision. In decision trees with n children, the split function must return a n-ary output. At first, some of the relevant symbols are explained:

  • θ … Weak learner function or binary test that decides how to forward the data to the left or right child.
  • φ … Selection of features from feature vector v with φ : Rd -> Rd’ and d >> d’. This points out, that φ uses a subset of all features for computation. This selection of features can vary from node to node.
  • ψ … Defines geometric primitive that is used to separate the data. Possible primitives are an axis-aligned hyperplane or an oblique hyperplane.
  • τ … Contains thresholds for the binary tests. This thresholds decide whether the test returns “true” or “false”.

The binary test function is defined as:

h(vj) ∈ {0,1}    (#4)

where θ is defined as θ = (φ, ψ, τ).

For linear data separation the following test function is defined:

h ( v, θj) = [ τ1 > φ( v ) • ψ > τ2]    (#5)

In #5 the • operation is called the indicator function that indicates “true” or “false” depending on the arguments. This can be seen as the separation operator. It separates the chosen features of data point v with φ(v) on the hyperplane ψ. After that, the thresholds can be used to control the result. For simple linear applications these thresholds are set to τ1 = – ∞ and τ 2 = ∞. These thresholds can be adapted during the training phase.
For non-linear data separation a more complex weak learner can be used to replace hyperplanes with a higher degree of freedom:

h ( v, θ j) = [ τ1 > φT (v) ψ φ (v) > τ 2 ]     (#6)
The goal of the training process for node j is to optimize the parameters of θj in the split function and the optimization of the chosen objective function I j with maximizing the information gain. The training objective function is defined by:

(#7)

using: Ij = I(Sj, SjR, SjL, θj).   (#8)

Equation #7 represents the expected goal of the training. The optimal parameters θ*j are computed during the training. S j, SjR and SjL represent the training points before splitting and after splitting to the right as well as the left child.

The search for the maximum of each node can be achieved using exhaustive searches. Finding the optimal value for τ can be calculated using the mean of integral histograms. In most Random Forests one split function (weak learner function) exists that is used in all nodes of all trees, but in some cases it can be necessary to use more than one type of split functions for a special problem.

Leaf and Prediction

The leaf predicts the result for a given test data point. Therefore it stores some important information:

  • Classification: The leaf stores an empirical distribution over the classes associated to the data point. The predictor p(c|v) with c ∈ {ck} represents the probability that data point vis an element of class c.
  • Regression: The leaf returns a continuous variable or a point estimation e.g. c* = argmaxcpt(c|v).

Training

Random Forests are often trained in an offline training phase, but there are some approaches which use Random Forests in combination with online training [17]. To reach a significant high generalization quality the randomness of provided training data must be injected during the training process. It can be injected in a wide variety of ways but the common way is to provide a random set of training data to each tree. The randomness is only injected during the training phase and not during the test phase.
j is denoted as the set of training data that reached node j. S jR and SjL are defined as the set of data that reaches the child R or L of node j. Equation#9 describes the relationship.

(#9)

Furthermore S0 equals a training set of (randomly chosen) data points {v} injected at the root node 0of each tree. This training data and the associated ground truth are forwarded at each split node with the goal to minimize the energy function. At each split node, the training data will be separated with the maximization of the information gain (equals the search for the best separation and their corresponding parameters). For that, the weak learner θ in each split node is used with the binary test function h(v,θj) to realize θj* from equations #7 and #8. After that, the stopping criteria are checked and the creation of a new split node or leaf started. A split node will be created, when the stopping criteria was not reached. Otherwise, a leaf will be created as mentioned in section “Leaf and Prediction”.

The training stops with one or more predefined stopping criteria. For stopping the training of the forest, especially each tree, the following criteria exist:

  • Maximum tree depth is reached, defined as D
  • Less then a number of defined number of training examples reached that node

The choice of an optimal stopping criteria can help to increase the generalization performance. Other optimizations like tree-pruning exist[13].

Testing

During the test, each split node applies its specific split function to the test data that reached it. This split function can be the same function for all nodes with a variation in the parameters. Another method to provide a split function to a node is the choice of a sub-problem solving function (e.g. the categorization of a photo, outside or inside). After the split functions in each node did its work the result is forwarded to the leaf. The leaf acts as predictor. It contains a probability of class affiliation (classification) or an estimated function value (regression). This procedure works for one decision tree. In a Random Forest, a bunch of decision trees are executing this procedure on the same input data. This is the benefit of a Random Forest, where more than one predictor estimated the result. All estimations of the decision trees are than combined to one estimation which is the final result.

Decision Forest Model

The model of a Random Forest combines a bunch of decision trees (in this document binary decision trees) to an ensemble of weak learners. The number of trees is static and was defined during the conceptional stage of the development. The optimal number of trees depends on the specific problem. It is important to inject the right amount of randomness and to get an expressive prediction out of the forest. The sections below provide considerations about these two topics.

Randomness

Randomness is one of the important components of Random Forests. The randomness offers robustness against noisy data and an improvement of generalization. Two options to inject randomness during the training into the forest are random training dataset sampling [14] and randomized node optimization [15].

As already mentioned in previous sections the randomness is injected using p. These parameters have the following influence on the trees:

  • Large values of p correspond to little randomness and large tree correlation. This is not the optimal case, because the forest acts more like a single decision tree, e.g. p = |Υ|where Υ represents the complete set of all possible combinations of parameters θ.
  • Small values of p correspond to high randomness and the trees act different, e.g. p = 1.

For the injection of randomness for each node equation #7 comes into account. This equation needs a modification to be used for a subset of Υ.

(#9)

using Υj ⊂ Υ.
The ratio of randomness is controlled by j| / |Υ|.
For | Υ | ≠ ∞ the parameter p = |Υ| is introduced. As mentioned above, this parameter can be assigned with these values p = 1,…,|Υ|. For p = 1 the trees have the highest randomness in the system. In contrast to that, p = |Υ| provides the lowest grade of randomness.

The Ensemble Model

To get a prediction for test data point v from the forest T with t ∈ {1,..,T} trees a combination of all tree predictions is calculated using equation:

(#10)

For classification the result of p(c|v) points out the probability that data point v is an element of class c. On the other side, for regression the tree posteriors have a confidence value for its estimation. To get the final result an average calculation about all tree posteriors with their corresponding confidences value is done. Tree testing can be done in parallel on using modern parallel CPU or GPU computation [16].

Sources

[1] L.Breiman, J. Friedman, C. J. Stone and R. A. Olshen, Classification and Regression Trees, Chapman and Hall/CRC, 1984.
[2] R. E. Schapire, The strength of weak learnability, Machine Learning, 5(2):197 – 227, 1990.
[3] T. K. Ho, Random decision forests, Intl Conf. on Document Analysis and Recognition, pages 278 – 282, 1995.
[4] Y. Amit and D. Geman, Shape quanization and recognition with randomized trees, Neural Computation, 9:1545 – 1588, 1997.
[5] A. Criminisi, J. Shotton, D. Robertson and E. Konukoglu, Regression forests for efficient anatomy detection and localization in CT studies, MICCAI workshop on Medical Computer Vision: Recognition Techniques and Applications in Medical Imaging, Beijing, 2010, Springer.
[6] A. Montillo, J.Shotton, J.E. Winn, Metaxas D. Iglesias and A. Criminisi, Entangled decision forests and their application for semantic segmentation of CT images, Information Processing in Medical Imaging (IPMI), 2011.
[7] B. W. Silvermann, Density Estimation, Chapman and Hall, London, 1986.
[8] J. Shotton, M. Johnson and R. Cipolla, Semantic texton forests for image categorization and segmentation, IEEE CVPR, 2008.
[9] N. Duchateau, M. De Craene, G. Piella and A. F. Frangi, Characterizing pathological deviations from normality using constrained manifold learning, MICCAI, 2011.
[10] Q. Zhang, R. Souvenir and R. Pless, On manifold structure of cardiac MRI data: Application to segmentation, In IEEE Computer Vision and Pattern Recognition, Los Alamitos, CA, USA, 2006.
[11] C. E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal, 1948.
[12] S. Mannor, R. Meir, Y. Bengio and D. Schuurmans, On the Existence of Linear Weak Learners and Applications to Boosting, 2002.
[13] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, 2001.
[14] L. Breiman, Random forests, Machine Learning, 45(1):5 – 32, 2001.
[15] T. K. Ho, The random subspace method for constructing decision forests, PAMI, 20(8):832 – 844, 1998.
[16] T. Sharp, Implementing decision trees and forests on a GPU, In ECCV, 2008.
[17] A. R. Saffari, C. Leistner and H. Bischof, Online Random Forests, In Proc. IEEE Online Learning for Computer Vision Workshop, 2009.