Boosted Decision Tree Tutorial

Elements of supervised learning

Model and parameter

The model in supervised learning usually refers to the mathematical structure of by which the prediction $y_i$ is made from the input $x_i$. A common example is a linear model, where the prediction is given as $\hat{y}_i = \Sigma_{j}\theta_{j}x_{ij}$, a linear combination of weighted input features.

The parameters are the undetermined part that we need to learn from data. In linear regression problems, the parameters are the coefficients $\theta$.

Objective function: training loss + regularization

The task of training the model amounts to finding the best parameters $\theta$ that best fit the training data $x_i$ and labels $y_i$. In order to train model, we need to define the objective function to measure how well the model fit the training data.

They consist two parts: training loss and regularization term: $$obj(\theta) = L(\theta) + \Omega(\theta)$$ where L is the training loss function, and $\Omega$ is the regularization term.

The training loss measures how predictive our model is with respect to the training data: $$L(\theta) = \Sigma_{i}(y_i - \hat{y}_i)^2$$

The regularization term controls the complexity of the model, which helps us to avoid overfitting.

Decision tree ensembles

The tree ensemble model consists of a set of classification and regression trees (CART). In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification.

Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which sums the prediction of multiple trees together. An important fact about the example is that the two trees try to complement each other. We can write the model in the form: $$\hat{y}_{i} = \sum\limits_{k = 1}^{K}f_{k}(x_{i}), f_{k}\in\mathcal{F}$$ where K is the number of trees, $f_{k}$ is a function in the functional space $\mathcal{F}$, and $\mathcal{F}$ is the set of all possible CARTs.

The objective function to be optimized is given by: $$obj(\theta) = \sum\limits_{i}^{n}l(y_{i}, \hat{y}_{i}) + \sum\limits_{k = 1}^{K}\omega(f_{k})$$ where $\omega(f_{k})$ is the complexity of the tree $f_{k}$.

Tree boosting

How should we learn the trees? Define an objective function and optimize it!

Additive training

What are the parameters of trees? What we need to learn are those functions $f_i$, each containing the structure of the tree and the leaf scores.

Tutorial

The data

Load data

First, load in the data and look at it. We've taken a 10k event subsample of the Kaggle training data. Then we'll put it in the right format for xgboost.

Let's see what the data looks like:

The data set has 10,000 events with 33 columns each. It looks like the first column is an identifier, and shouldn't be used as a feature. The last two columns "Weight" and "Label", are the weights and labels from the simulation, and also shouldn't be used as feature (this information is all contained in the documentation).

Now we can look as how many events are signal and background:

Format data:

Now we should get the data into an XGBoost-friendly format. We can create DMatrix objects that will be used to train the BDT model. For now, we'll use all 30 of the features for training.

First, we'll sliceup the data into training and testing sets. Here, we take 20% for the test set, which is arbitrary.

In this file, all samples are independent and ordered randomly, so we can just grab a chunk. Check out scikit-learn Cross-validation for dividing up samples in a responsible way.

We can also change the data type of the "Label" column to the pandas type "category" for easier use later.

Check to make sure we did it right:

The DMatrix object takes as arguments:

Check if we did it right:

Make the model

Set hyperparameters:

In general, the tunable parameters in XGBoost are the ones you would see in other gradient boosting libraries. Here, they fall into three categories:

  1. General parameters - Ex.: which booster to use, number of threads. I won't mess with any of these here.
  2. Booster parameters - Tune the actual boosting. Ex.: learning rate. These are the ones to optimize.
  3. Learning task parameters - Define the objective function and the evaluation metrics.

Here, we will use the defaults for most patameters and just set a few to see how it's done. The parameters are passed in as a dictionary or list of pairs.

Make the parameter dictionary:

First, we set the booster parameters. Again, we just chose a few here to experiment woth. These are the paraters to tune to optimize your model. Generally, there is a trade off between speed and accuracy.

  1. eta is the learning rate. It determines how much to change the data weights after each boosting iteration. The default is 0.3.
  2. max_depth is the maximum depth of any tree. The default is 6.
  3. subsample is the fraction of events used to train each new tree. These events are randomly sampled each iteration from the whole sample set. The default is 1 (use every event for each tree).
  4. colsample_bytree is the fraction of features available to train each new tree. These features are randomly sampled each iteration from the whole feature set. The default is 1.

Next, we set the learning objective to binary:logistic. So, we have two classes that we want to score from 0 to 1. The eval_metric parameters set what we want to monitor when doing cross validation. (We aren't doing cross validation in this example, but we really should be!) If you want to watch more than one metric, 'param' must be a list of pairs, instead of a dict.Otherwise, we would just keep resetting the same parameter.

Last, we set the number of trees to 100. Usually, you would set this number high, and choose a cut off point based on the cross validation. The number of trees is the same as the number of iterations.

Now train!

We now have a trained model. The next step is to look at it's performance and try to improve the model if we need to. We can try to improve it by improving/adding featrues, adding more training data, using more boosting iterations, or tuning the hyperparameters (ideally in that order).

Evaluate

First, let's look at how it dose on the test set:

These are the evaluation metrics that we stored in the parameter set.

It's pretty hard to interpret the performance of a classifier from a few number. So, let's look at the predictions for the entire test set.

It's also very informative to look at the importance of each feature. The "F score" is the number of times each feature is used to split the data over all of the trees (times the weight of that tree).

There is a built-in function in the XGBoost python API to easily plot this:

The feature that was used the most was DER_mass_MMC. (For this data the "DER" prefix is for devided variables, and "PRI" is for raw variables.)

We can plot how this feature is distributed for the signal and background:

There is not a lot of discriminating power in that variable. For fun, we can plot it with the next most important feature: