The model in supervised learning usually refers to the mathematical structure of by which the prediction $y_i$ is made from the input $x_i$. A common example is a linear model, where the prediction is given as $\hat{y}_i = \Sigma_{j}\theta_{j}x_{ij}$, a linear combination of weighted input features.
The parameters are the undetermined part that we need to learn from data. In linear regression problems, the parameters are the coefficients $\theta$.
The task of training the model amounts to finding the best parameters $\theta$ that best fit the training data $x_i$ and labels $y_i$. In order to train model, we need to define the objective function to measure how well the model fit the training data.
They consist two parts: training loss and regularization term: $$obj(\theta) = L(\theta) + \Omega(\theta)$$ where L is the training loss function, and $\Omega$ is the regularization term.
The training loss measures how predictive our model is with respect to the training data: $$L(\theta) = \Sigma_{i}(y_i - \hat{y}_i)^2$$
The regularization term controls the complexity of the model, which helps us to avoid overfitting.

The tree ensemble model consists of a set of classification and regression trees (CART). In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification.
Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which sums the prediction of multiple trees together. An important fact about the example is that the two trees try to complement each other. We can write the model in the form: $$\hat{y}_{i} = \sum\limits_{k = 1}^{K}f_{k}(x_{i}), f_{k}\in\mathcal{F}$$ where K is the number of trees, $f_{k}$ is a function in the functional space $\mathcal{F}$, and $\mathcal{F}$ is the set of all possible CARTs.
The objective function to be optimized is given by: $$obj(\theta) = \sum\limits_{i}^{n}l(y_{i}, \hat{y}_{i}) + \sum\limits_{k = 1}^{K}\omega(f_{k})$$ where $\omega(f_{k})$ is the complexity of the tree $f_{k}$.
How should we learn the trees? Define an objective function and optimize it!
What are the parameters of trees? What we need to learn are those functions $f_i$, each containing the structure of the tree and the leaf scores.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
import xgboost as xgb
%matplotlib inline
First, load in the data and look at it. We've taken a 10k event subsample of the Kaggle training data. Then we'll put it in the right format for xgboost.
data = pd.read_csv('training_10k.csv')
Let's see what the data looks like:
print('Size of data: {}'.format(data.shape))
print('Number of events: {}'.format(data.shape[0]))
print('Number of columns: {}'.format(data.shape[1]))
print('\nList of features in dataset:')
for col in data.columns:
print(col)
Size of data: (10000, 33) Number of events: 10000 Number of columns: 33 List of features in dataset: EventId DER_mass_MMC DER_mass_transverse_met_lep DER_mass_vis DER_pt_h DER_deltaeta_jet_jet DER_mass_jet_jet DER_prodeta_jet_jet DER_deltar_tau_lep DER_pt_tot DER_sum_pt DER_pt_ratio_lep_tau DER_met_phi_centrality DER_lep_eta_centrality PRI_tau_pt PRI_tau_eta PRI_tau_phi PRI_lep_pt PRI_lep_eta PRI_lep_phi PRI_met PRI_met_phi PRI_met_sumet PRI_jet_num PRI_jet_leading_pt PRI_jet_leading_eta PRI_jet_leading_phi PRI_jet_subleading_pt PRI_jet_subleading_eta PRI_jet_subleading_phi PRI_jet_all_pt Weight Label
The data set has 10,000 events with 33 columns each. It looks like the first column is an identifier, and shouldn't be used as a feature. The last two columns "Weight" and "Label", are the weights and labels from the simulation, and also shouldn't be used as feature (this information is all contained in the documentation).
Now we can look as how many events are signal and background:
# look at column labels --- notice last one is "Label" and first is "EventID" also "Weight"
print('Number of signal events: {}'.format(len(data[data.Label == 's'])))
print('Number of background events: {}'.format(len(data[data.Label == 'b'])))
print('Fraction signal: {}'.format(len(data[data.Label == 's']) / (float)(len(data[data.Label == 's']) + len(data[data.Label == 'b']))))
Number of signal events: 3372 Number of background events: 6628 Fraction signal: 0.3372
Now we should get the data into an XGBoost-friendly format. We can create DMatrix objects that will be used to train the BDT model. For now, we'll use all 30 of the features for training.
First, we'll sliceup the data into training and testing sets. Here, we take 20% for the test set, which is arbitrary.
In this file, all samples are independent and ordered randomly, so we can just grab a chunk. Check out scikit-learn Cross-validation for dividing up samples in a responsible way.
We can also change the data type of the "Label" column to the pandas type "category" for easier use later.
data['Label'] = data.Label.astype('category')
data_train = data[:8000]
data_test = data[8000:]
Check to make sure we did it right:
print('Number of training samples: {}'.format(len(data_train)))
print('Number of testing samples: {}'.format(len(data_test)))
print('\nNumber of signal events in training set: {}'.format(len(data_train[data_train.Label == 's'])))
print('Number of background events in training set: {}'.format(len(data_train[data_train.Label == 'b'])))
print('Fraction signal: {}'.format(len(data_train[data_train.Label == 's']) / (float)(len(data_train[data_train.Label == 's']) + len(data_train[data_train.Label == 'b']))))
Number of training samples: 8000 Number of testing samples: 2000 Number of signal events in training set: 2688 Number of background events in training set: 5312 Fraction signal: 0.336
The DMatrix object takes as arguments:
feature_names = data.columns[1:-2] # we skip the first and last two columns because they are the ID, weight, and label
train = xgb.DMatrix(data = data_train[feature_names], label = data_train.Label.cat.codes,
missing = -999.0, feature_names = feature_names)
test = xgb.DMatrix(data = data_test[feature_names], label = data_test.Label.cat.codes,
missing = -999.0, feature_names = feature_names)
Check if we did it right:
print('Number of training samples: {}'.format(train.num_row()))
print('Number of testing samples: {}'.format(test.num_row()))
print('\nNumber of signal events in training set: {}'.format(len(np.where(train.get_label())[0])))
Number of training samples: 8000 Number of testing samples: 2000 Number of signal events in training set: 2688
In general, the tunable parameters in XGBoost are the ones you would see in other gradient boosting libraries. Here, they fall into three categories:
Here, we will use the defaults for most patameters and just set a few to see how it's done. The parameters are passed in as a dictionary or list of pairs.
Make the parameter dictionary:
param = {}
# Booster parameter
param['eta'] = 0.1 # learning rate
param['max_depth'] = 10 # maximum depth of a tree
param['subsample'] = 0.8 # fraction of events to train tree on
param['colsample_bytree'] = 0.8 # fraction of features to tree on
# Learning task parameters
param['objective'] = 'binary:logistic' # objective function
param['eval_metric'] = 'error' # evaluation metric for cross validation
param = list(param.items()) + [('eval_metric', 'logloss')] + [('eval_metric', 'rmse')]
num_trees = 300 # number of trees to make
First, we set the booster parameters. Again, we just chose a few here to experiment woth. These are the paraters to tune to optimize your model. Generally, there is a trade off between speed and accuracy.
eta is the learning rate. It determines how much to change the data weights after each boosting iteration. The default is 0.3.max_depth is the maximum depth of any tree. The default is 6.subsample is the fraction of events used to train each new tree. These events are randomly sampled each iteration from the whole sample set. The default is 1 (use every event for each tree).colsample_bytree is the fraction of features available to train each new tree. These features are randomly sampled each iteration from the whole feature set. The default is 1.Next, we set the learning objective to binary:logistic. So, we have two classes that we want to score from 0 to 1. The eval_metric parameters set what we want to monitor when doing cross validation. (We aren't doing cross validation in this example, but we really should be!) If you want to watch more than one metric, 'param' must be a list of pairs, instead of a dict.Otherwise, we would just keep resetting the same parameter.
Last, we set the number of trees to 100. Usually, you would set this number high, and choose a cut off point based on the cross validation. The number of trees is the same as the number of iterations.
booster = xgb.train(param, train, num_boost_round = num_trees)
We now have a trained model. The next step is to look at it's performance and try to improve the model if we need to. We can try to improve it by improving/adding featrues, adding more training data, using more boosting iterations, or tuning the hyperparameters (ideally in that order).
First, let's look at how it dose on the test set:
print(booster.eval(test))
[0] eval-error:0.172500 eval-logloss:0.428761 eval-rmse:0.354899
These are the evaluation metrics that we stored in the parameter set.
It's pretty hard to interpret the performance of a classifier from a few number. So, let's look at the predictions for the entire test set.
predictions = booster.predict(test)
# plot all predictions (both signal and background)
plt.figure()
plt.hist(predictions, bins = np.linspace(0, 1, 50), histtype = 'step', color = 'darkgreen', label = 'All events')
# make the plot readable
plt.xlabel('Prediction from BDT', fontsize = 12)
plt.ylabel('Events', fontsize = 12)
plt.legend(frameon = False)
# plot signal and background separately
plt.figure()
plt.hist(predictions[test.get_label().astype(bool)], bins = np.linspace(0, 1, 50),
histtype = 'step', color = 'midnightblue', label = 'signal')
plt.hist(predictions[~(test.get_label().astype(bool))], bins = np.linspace(0, 1, 50),
histtype = 'step', color = 'firebrick', label = 'background')
# make the plot readable
plt.xlabel('Prediction from BDT', fontsize = 12)
plt.ylabel('Events', fontsize = 12)
plt.legend(frameon = False)
<matplotlib.legend.Legend at 0x1221d76a0>
# choose score cuts:
cuts = np.linspace(0, 1, 500)
nsignal = np.zeros(len(cuts))
nbackground = np.zeros(len(cuts))
for i, cut in enumerate(cuts):
nsignal[i] = len(np.where(predictions[test.get_label().astype(bool)] > cut)[0])
nbackground[i] = len(np.where(predictions[~(test.get_label().astype(bool))] > cut)[0])
# plot efficiency vs. purity (ROC curve)
plt.figure()
plt.plot(nsignal/len(data_test[data_test.Label == 's']), nsignal/(nsignal + nbackground), 'o-', color = 'blueviolet')
# make the plot readable
plt.xlabel('Efficiency', fontsize = 12)
plt.ylabel('Purity', fontsize = 12)
plt.legend(frameon = False)
/var/folders/bc/krb44xz13pd_xw5vkr8y6_nr0000gn/T/ipykernel_90110/128684294.py:11: RuntimeWarning: invalid value encountered in true_divide plt.plot(nsignal/len(data_test[data_test.Label == 's']), nsignal/(nsignal + nbackground), 'o-', color = 'blueviolet') No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
<matplotlib.legend.Legend at 0x122311a90>
It's also very informative to look at the importance of each feature. The "F score" is the number of times each feature is used to split the data over all of the trees (times the weight of that tree).
There is a built-in function in the XGBoost python API to easily plot this:
xgb.plot_importance(booster, grid = False)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>
#rcParams['figure.figsize'] = 500,400
xgb.plot_tree(booster, num_trees = 3)
fig = plt.gcf()
fig.set_size_inches(300, 300)
fig.savefig('tree.png')
The feature that was used the most was DER_mass_MMC. (For this data the "DER" prefix is for devided variables, and "PRI" is for raw variables.)
We can plot how this feature is distributed for the signal and background:
plt.figure()
plt.hist(data_train.DER_mass_MMC[data_train.Label == 's'], bins = np.linspace(0, 800, 50),
histtype = 'step', color = 'midnightblue', label = 'signal')
plt.hist(data_train.DER_mass_MMC[data_train.Label == 'b'], bins = np.linspace(0, 800, 50),
histtype = 'step', color = 'firebrick', label = 'background')
plt.xlabel('DER_mass_MMC', fontsize = 12)
plt.ylabel('Events', fontsize = 12)
plt.legend(frameon = False)
<matplotlib.legend.Legend at 0x12391a160>
There is not a lot of discriminating power in that variable. For fun, we can plot it with the next most important feature:
plt.figure()
plt.plot(data_train.DER_mass_MMC[data_train.Label == 'b'], data_train.DER_mass_transverse_met_lep[data_train.Label == 'b'],
'o', markersize = 2, color = 'firebrick', markeredgewidth = 0, alpha = 0.8, label = 'background')
plt.plot(data_train.DER_mass_MMC[data_train.Label == 's'], data_train.DER_mass_transverse_met_lep[data_train.Label == 's'],
'o', markersize = 2, color = 'mediumblue', markeredgewidth = 0, alpha = 0.8, label = 'signal')
plt.xlim(0, 400)
plt.ylim(0, 200)
plt.xlabel('DER_mass_MMC', fontsize = 12)
plt.ylabel('DER_mass_transverse_met_lep', fontsize = 12)
plt.legend(frameon = False, numpoints = 1, markerscale = 2)
<matplotlib.legend.Legend at 0x125785190>