7.1. Regression trees

We will start with regression trees, a type of decision tree model that can be used for regression problems.

Watch the 22-minute video below for a visual explanation of regression trees.

7.1.1. Regression trees for salary prediction

Get ready by importing the APIs needed from respective libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

%matplotlib inline

Set a random seed for reproducibility.

np.random.seed(2022)

We use the Hitters dataset (click to explore) to predict a baseball player’s Salary based on Years (the number of years that the player has played in the major leagues) and Hits (the number of hits that the player has made in the major leagues).

Load the dataset and remove the rows with missing values. Inspect the first few rows of the dataset.

hitters_url = "https://github.com/pykale/transparentML/raw/main/data/Hitters.csv"

hitters_df = pd.read_csv(hitters_url).dropna()
hitters_df.head()
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI CWalks League Division PutOuts Assists Errors Salary NewLeague
1 315 81 7 24 38 39 14 3449 835 69 321 414 375 N W 632 43 10 475.0 N
2 479 130 18 66 72 76 3 1624 457 63 224 266 263 A W 880 82 14 480.0 A
3 496 141 20 65 78 37 11 5628 1575 225 828 838 354 N E 200 11 3 500.0 N
4 321 87 10 39 42 30 2 396 101 12 48 46 33 N E 805 40 4 91.5 N
5 594 169 4 74 51 35 11 4408 1133 19 501 336 194 A W 282 421 25 750.0 A

Take a look at the structure of the dataset.

hitters_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 263 entries, 1 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      263 non-null    int64  
 1   Hits       263 non-null    int64  
 2   HmRun      263 non-null    int64  
 3   Runs       263 non-null    int64  
 4   RBI        263 non-null    int64  
 5   Walks      263 non-null    int64  
 6   Years      263 non-null    int64  
 7   CAtBat     263 non-null    int64  
 8   CHits      263 non-null    int64  
 9   CHmRun     263 non-null    int64  
 10  CRuns      263 non-null    int64  
 11  CRBI       263 non-null    int64  
 12  CWalks     263 non-null    int64  
 13  League     263 non-null    object 
 14  Division   263 non-null    object 
 15  PutOuts    263 non-null    int64  
 16  Assists    263 non-null    int64  
 17  Errors     263 non-null    int64  
 18  Salary     263 non-null    float64
 19  NewLeague  263 non-null    object 
dtypes: float64(1), int64(16), object(3)
memory usage: 43.1+ KB

Decision trees typically do not require feature scaling. However, take a look at the distribution of the Salary variable on the original scale and on the log scale, via a histogram below. We can see that the distribution of Salary is skewed (to the left), and that the log distribution is more symmetric, closer to a bell shape, i.e. the normal/Gaussian distribution that is preferred in many machine learning models. Therefore, we will use the log-transform of Salary as the target variable in our regression tree model.

X = hitters_df[["Years", "Hits"]].values
y = np.log(hitters_df["Salary"].values)  # log transform the target variable

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
ax1.hist(hitters_df.Salary.values, bins=20)
ax1.set_xlabel("Salary")
ax2.hist(y, bins=20)
ax2.set_xlabel("Log(Salary)");
../_images/regression-trees_9_0.png

Let’s fit a regression tree to the data. We will use the DecisionTreeRegressor class from the sklearn.tree module. We will set the max_leaf_nodes hyperparameter to 3, which means that the tree will have at most 3 leaf nodes.

regr = DecisionTreeRegressor(max_leaf_nodes=3)
regr.fit(X, y)
DecisionTreeRegressor(max_leaf_nodes=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

One attractive feature of a tree-based model is visualisation. Let us visualise the learnt tree using the plot_tree function from the sklearn.tree module. We will set the filled parameter to True to colour the leaf nodes according to the predicted value (the larger the value, the darker the colour). We will set the feature_names parameter to ['Years', 'Hits'] to label the features.

plt.figure(figsize=(15, 10))  # customize according to the size of your tree
plot_tree(regr, filled=True, feature_names=["Years", "Hits"], fontsize=14)
plt.show()
../_images/regression-trees_13_0.png

The above depicts a regression tree fitted to the data. The tree has three leaf nodes, and each leaf node corresponds to a region in the feature space. The predicted value for each region is the average value of the Salary variable in the training data that falls into that region. We can examine the system logic of this simple regression tree to reveal its system transparency.

System transparency

  • When Years \(=5\) and Hits \(=100\), the predicted Salary \(=\) $\(1000\times e^{5.998}\approx 403,000\), which can be read out from the tree above by following the right branch from the top first and then the left branch of the next level, reaching the leaf node in the middle.

  • To produce an estimated Salary of $\(1000\times e^{6.6}\approx 735,095\), we find the leaf node that has a predicted value closest to \(6.6\), which is the right most leaf node, and follow the path from that leaf node to the top to get that we should have Years \(>4.5\) and Hits \(>117.5\), e.g. Years \(=5\) and Hits \(=120\), for a Salary closest to $\(735,095\). As we can see here, the estimation from the tree above is quite coarse, and we can get a more accurate estimation by increasing the number of leaf nodes.

We can see that this tree consists of a series of splitting rules starting from the top. Each rule is used to divide the feature space into two regions. For example, the first splitting rule is Years <= 4.5, which means that if the number of years that a player has played in the major leagues is less than or equal to 4.5, then the player will be assigned to the left child node, with a predicted salary of $\(1000\times e^{5.107}\approx 165,000\). Otherwise, the player will be assigned to the right child node. The second splitting rule is Hits <= 117.5, which means that if the number of hits that a player has made in the major leagues is less than or equal to 117.5 then the player will be assigned to the left child node, with a predicted salary of $\(1000\times e^{5.998}\approx 403,000\). Otherwise, the player will be assigned to the right child node, with a predicted salary of $\(1000\times e^{6.740}\approx 845,000\).

Thus, the tree is a piecewise linear function of the features. The predicted value for a new data point is the average value of the Salary variable in the training data that falls into the region that the new data point falls into. It is easier to interpret, and has a nice graphical representation.

We can further visualise the tree by plotting the regions in the feature space.

hitters_df.plot("Years", "Hits", kind="scatter", color="orange", figsize=(7, 6))
plt.xlim(0, 25)
plt.ylim(ymin=-5)
plt.xticks([1, 4.5, 24])
plt.yticks([1, 117.5, 238])
plt.vlines(4.5, ymin=-5, ymax=250)
plt.hlines(117.5, xmin=4.5, xmax=25)
plt.annotate("R1", xy=(2, 117.5), fontsize="xx-large")
plt.annotate("R2", xy=(11, 60), fontsize="xx-large")
plt.annotate("R3", xy=(11, 170), fontsize="xx-large");
../_images/regression-trees_15_0.png

Decision trees are typically drawn upside down, with the root node at the top and the leaves at the bottom. The root node is the first splitting rule, and the leaves are the terminal nodes, regions R1, R2, and R3 in this example. The points along the tree where the feature space is divided into two regions are called internal nodes or splitting nodes. The splitting rules are the conditions that are used to divide the feature space into two regions. The splitting rules are represented by the lines that connect the internal nodes to the leaves. These lines are called branches and the splitting rules are also called decision rules.

7.1.2. How to build/learn a regression tree?

How to build (i.e. learn) a regression tree? In other words, how to find \(J\) regions best dividing (partitioning) the feature space for prediction? Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into \(J\) regions. Instead, again, we take a top-down, greedy approach like in the stepwise feature selection that is known as recursive binary splitting (recursive partitioning).

7.1.2.1. Recursive binary splitting

In recursive binary splitting, we start with the entire feature space as the root node (one single region), and then we successively split the feature space into two regions, in \(J-1\) steps. In each step \(j\), we consider all features and all possible values of the threshold \(T_d\) (cutpoint, or splitting point) for each feature \(x_d\), and then choose the feature and threshold such that if using it to split one of the existing \(j\) regions, the resulting tree has the lowest cost (best performance). In other words, the splitting rule is chosen by finding the the feature \(j\) out of a total of \(D\) features \(x_1, \cdots , x_D\), with a threshold \(T_j\) that splits one existing region to minimise a cost function (splitting criterion). Thus, each step \(j\) consists of \(D\) sub-steps, one for each feature. In each sub-step \(d\), we find the optimal threshold \(T_d\) that minimises the cost function for feature \(x_d\) after splitting an existing region. Then we choose the \(j\)th feature \(x_j\) (with its optimal threshold \(T_j\)) that minimises the cost function. The splitting rule for step \(j\) is then given by \(x_j \leq T_j\).

This recursive binary splitting method is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step, as in stepwise feature selection. Therefore, it is also susceptible to overfitting.

7.1.2.2. Cost function

Cost functions for regression trees can be any performance metric for regression, such as the residual sum of squares (RSS) defined in simple linear regression (Equation (2.3)) or the mean squared error (MSE) defined in simple linear regression (Equation (2.8)).

7.1.2.3. When to stop splitting?

In regression trees, instead of specifying the number of regions \(J\) in advance, we stop splitting using a stopping criteria. The most common stopping criteria is to stop splitting when the number of data points in a region is less than or equal to a pre-specified number \(R\) (e.g. 5). This is called pre-pruning or early stopping. Another stopping criteria is to stop splitting when the cost function (or its changes over the last \(K\) steps) is less than or equal to a pre-specified number \(\epsilon\) (e.g. 0.01). This is called post-pruning or late stopping.

You may refer to the advantages and disadvantages of decision trees and tips on practical use in the scikit-learn documentation.

7.1.3. Regression trees for income prediction

Next, let us study the Boston dataset (click to explore). In this dataset, we want to predict the median value of owner-occupied homes in $1000s in Boston, based on 13 features (predictors) such as crime rate, average number of rooms per dwelling, and proportion of non-retail business acres per town.

Load the dataset and inspect the first few rows of the dataset.

boston_url = "https://github.com/pykale/transparentML/raw/main/data/Boston.csv"

boston_df = pd.read_csv(boston_url)
boston_df.head()
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

Take a look at the structure of the dataset.

boston_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       506 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  black    506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB

Split the dataset into training (50%) and test sets and fit a regression tree.

X2 = boston_df.drop("medv", axis=1)
y2 = boston_df.medv
train_size = 0.5
max_depth = 3

X_train, X_test, y_train, y_test = train_test_split(X2, y2, train_size=train_size)
regr_tree = DecisionTreeRegressor(max_depth=max_depth)
regr_tree.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Visualise the tree.

plt.figure(figsize=(28, 10))  # customize according to the size of your tree
plot_tree(regr_tree, filled=True, feature_names=X2.columns, fontsize=14)
plt.show()
../_images/regression-trees_24_0.png

Compute the test set MSEs.

y_pred = regr_tree.predict(X_test)
print(mean_squared_error(y_test, y_pred))
29.741127328020855

7.1.4. Exercises

1. All the following exercises use the Carseats dataset.

Load the Carseats dataset, convert the values of variables (predictors) from categorical data to numerical data, and inspect the first five rows.

# Write your code below to answer the question

Compare your answer with the reference solution below

import pandas as pd

carseat_url = "https://github.com/pykale/transparentML/raw/main/data/Carseats.csv"

carseat_df = pd.read_csv(carseat_url)
carseat_df["ShelveLoc"] = carseat_df["ShelveLoc"].factorize()[0]
carseat_df["Urban"] = carseat_df["Urban"].factorize()[0]
carseat_df["US"] = carseat_df["US"].factorize()[0]
carseat_df.head()
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
0 9.50 138 73 11 276 120 0 42 17 0 0
1 11.22 111 48 16 260 83 1 65 10 0 0
2 10.06 113 35 10 269 80 2 59 12 0 0
3 7.40 117 100 4 466 97 2 55 14 0 0
4 4.15 141 64 3 340 128 0 38 13 0 1

2. Create a training set using \(80\%\) of the data from Exercise 1 and use the remaining \(20\%\) for testing. Fit a Decision Tree Regressor to the data, predicting the Sales of the carseats. Calculate the MSE and \(R^2\) score of the model. Use \(2022\) as the random state value for train_test_split and your DecisionTreeRegressor

# Write your code below to answer the question

Compare your answer with the reference solution below

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

np.random.seed(2022)
X2 = carseat_df.drop("Sales", axis=1)
y2 = carseat_df.Sales
train_size = 0.8
max_depth = 3

X_train, X_test, y_train, y_test = train_test_split(
    X2, y2, train_size=train_size, random_state=2022
)
regr_tree = DecisionTreeRegressor(max_depth=max_depth, random_state=2022)
regr_tree.fit(X_train, y_train)

y_pred = regr_tree.predict(X_test)
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
5.411861458676102
0.3428686250711688

3. Now, plot the model tree of the trained model.

# Write your code below to answer the question

Compare your answer with the reference solution below

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plt.figure(figsize=(28, 10))  # customize according to the size of your tree
plot_tree(regr_tree, filled=True, feature_names=X2.columns, fontsize=14)
plt.show()
../_images/regression-trees_39_0.png