Decision trees are one of the most popular ML algorithm.
They can be used for regression and classification.
They categorize data in a similar way to human thinking.
Therefore, they are also easy to understand and interpret.
Succinctly, in a decision tree, each node represents a feature, each branch represents a decision, and leaves show predictions.
There are various algorithms for building trees, such as ID3, CART. They differ in which questions they ask for dividing the data.

Building a decision tree using CART algorithm

CART stands for Classification and Regression Trees.

Classification algorithm

Choose a root node for the tree. The root node receives all the training set.
Each node asks a yes/no question about one of the features. The response to this question splits the data into two subsets.
Each subset is then divided into more subsets until a node is “pure” with respect to the outcome.
When a node is “pure”, we call it a leaf. A leaf needs no more splitting.
We can build an effective tree if we know which questions to ask and when. To find the best question among a set of questions, we need to quantify how much each question purifies the outcomes.
A method for quantifying uncertainty at a node is the “Gini impurity” method.
A method for quantifying how much a question reduces the uncertainty is “Information gain”.

What questions can be asked and when

If it is a categorical variable, each value from each feature of the input data is a candidate question. Questions are in the form “is the feature equal to value?”
If it is a continuous variable, one approach for choosing the split points is to sort all the values in a variable, and choose the mid-points between adjacent values for asking questions.
Another approach for choosing the split points with a continuous variable is to choose bins based on certain quantiles of the values, e.g. 20%, 40%, 60%, and. 80%.
This paper describes approaches for splitting continuous variables.
Questions of a continuous variable will be then like “is the feature equal to or greater than value?
In response to each question, the data is divided into two sets, one for which the answer is yes and one for which the answer is no.
Best questions are those that reduce the uncertainty the most.

Gini impurity

Gini metric is between zero and one. Zero means there is no uncertainty. One means there is maximum uncertainty.
Impurity is the chance that we are wrong if we randomly assign a label to a sample from the set. If a subset of samples all have the same label, then impurity is zero because we cannot make a wrong guess.
Impurity can be calculate as \(G = 1- \sum_{i=1}^{n} p_i^2\), where \(p_i\) is the probability of outcome \(i\) in the dataset.

Information gain

Information gain tells us which question reduces the uncertainty the most.
To quantify the amount of uncertainty of information reduction, we do the following:
1. Calculate the Gini impurity metric for all possible questions at a node.
2. Calculate the uncertainty of the two child nodes resulting from each question.
3. Take a weighted average of the two uncertainties. A weighted average gives more weight to the larger set.
  - To calculate the weighted average, take a normal mean of each set and multiply it by the fraction of the data in that set. Finally, sum the two means.
4. Subtract the uncertainty of the child nodes from the uncertainty of the parent node. That will be our information gain.
Choose the question that produces the most gain.

Regression algorithm

To build regression trees, we need to determine two things. a) when a subset of samples are 100% pure, so that we stop branching, and b) which feature value to use for creating two data subsets (branching).
A cart regression algorithm differs with its classification algorithm in choosing the best question.
Instead of aiming for splitting data to have two purerst subsets of data, we split the data to have the smallest residuals sums of squares.
To determine when to stop splitting, we can set a MSE as the acceptable amount.

Prunning

With prunning, we check whether a smaller tree could have given us similar results.
A smaller/simpler tree is favored because it reduces the chance of over-fitting.
An approach towards prunning is avoiding splitting a branch if information gain is not significant.
Another approach is to limit the size of a tree (maximum depth). If all leaves are pure, this will result in a highly complex and possibly overfit tree.
Another approach is to limit the maximum number of leaves.
Another method is to require a minimum number of samples in a node before splitting it.
The above approaches are not mutually exclusive and can be used at the same time.

Pros

Easy visualization and comprehension.
Does not need feature scaling.
Fast.

Cons

Decision tree regressors cannot extrapolate.
They overfit even with prunning.

Example

This is an example of making and plotting a decision tree.

using ScikitLearn
@sk_import tree: DecisionTreeClassifier

# Load the data
@sk_import datasets: load_breast_cancer
all_data = load_breast_cancer()
X = all_data["data"];
y = all_data["target"];

# Create a tree
model = DecisionTreeClassifier(min_samples_split=2, random_state=0, min_samples_leaf=1, max_depth=4)
fit!(model, X, y)

model.feature_importances_

# Plot the tree
using PyCall
@sk_import tree: export_graphviz
export_graphviz(model, out_file="mytree", class_names=["Healthy", "Cancerous"], feature_names=all_data["feature_names"], leaves_parallel=true, impurity=false, rounded=true, filled=true, label="root", proportion=true)

# To plot, you can either install graphviz for python. Or use online websites to view your plot.
# Installing graphviz: Method 1 using Conda
using Conda
Conda.add("graphviz")

# Method 2 using the installed Python on your computer
pyimport("graphviz")
dot_graph = read("mytree", String);
j = graphviz.Source(dot_graph, "mytree", "./", "pdf", "dot")
j.render()

# The `export_graphviz` function creates a file named "mytree" in the same directory you ran it. 
# Copy the content of the file and paste them in this website to view your tree:
# https://graphviz.christine.website/ or https://edotor.net/

Exercise

Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates. Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body.

This breast cancer dataset ¹ (download from here) was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

It contains physical properties of the lump, such as its size, and whether it is cancerous or not. Use a decision tree to diagnose the status of a lump given its physical features. Draw the tree to investigate the importance of the features. Write a description from the tree that which kinds tumors are most dangerous.

Kaggle: breast cancer dataset ↩

Tags:

Decision trees