Practical guides on XGBoost

Deep dive into practicalities of XGBoost

Mar 24, 2024

About this post

This post explains practical knowledges when using XGBoost. For a more conceptual introduction of what XGBoost is under the hood, please visit another post ‘Concepts of boosting algorithms in machine learning’.

Introduction

Extreme Gradient Boosting, XGBoost, is a powerful implementation of tree-based boosting algorithm in python. This note explores its core features of XGBoost, focusing on regularised learning objectives, gradient tree boosting, learning rate parameter, row and column subsampling techniques, and node splitting algorithms. By reading this note, you can understand the essence of XGBoost and its differences to other implementations.

The official website for XGBoost can be found here. Its corresponding paper can be found here.

Core features

While there are a number of key improvements in XGBoost, in this post we focus on the two most important ones: gradient tree boosting with regularisation and approximate node splitting algorithm.

Gradient tree boosting with regularisation

Like gradient tree boosting, at each boosting iteration, a weak tree is fitted and then added to the final predictor.

Specifically, when growing a weak tree, the optimal split at each node is determined when there is a maximum loss reduction as follows:

\(\Delta L_{\mathrm{split}} = L_{\mathrm{right}} + L_{\mathrm{left}} - L_{\mathrm{no~split}}\)

where

\(L_{k} = -\frac{1}{2} \sum_{i=1}^{N_{\mathrm{node}}} \frac{\sum_{j\in\mathrm{node~j}} g^2_j}{\sum_{j\in\mathrm{node~j}} h_j+\lambda} + \gamma N_{\mathrm{node}}\)

where k represents left, right or no-split,

\(g_j,~h_j\)

represents the gradient and hessian for data point j respectively,

\(\lambda,~\gamma\)

represents the regularisation hyperparameters to control the size of the weak tree.

This formula is derived by a minimisation of a second-order Taylor approximation to the regularised loss function. You can find the exact derivation in the original paper.

Node splitting

To find the best split for a given node during tree fitting, a split finding algorithm is needed to determine how to split the data points in this given node to left and right.

In particular, XGBoost provides supports to the exact, approximate and histogram-based splitting algorithm.

Exact algorithm

This algorithm enumerates over all features, and for each feature it iterates over all values to find the split that would maximise the loss reduction (defined in previous section). This algorithm is very computationally intensive. To slightly optimise for performance, XGBoost first sorts the data points with respect to feature values and accumulate gradient and hessian statistics accordingly.

Approximate algorithm

Instead of considering all possible splits from all features, the approximate algorithm iterates on a subset of splits for each feature only.

It is possible to propose candidate splits before and during tree fitting. The former, as known as global splitting, constructs candidate splits before any node splitting procedures. The latter, as known as local splitting, constructs candidate splits for each node splitting. XGBoost supports both methods.

Histogram-based algorithm:

This candidate splitting method is similar to the one used by LightGBM. Refer to practical guide on LightGBM in the same series to know more about this algorithm.

Missing data

In presence of missing data in training data, XGBoost handles this by assigning missing data to the side (left or right) such that the loss reduction is maximal.

Hyperparameters

While there are many hyperparameters to be tuned in XGBoost, this session explains the most important ones that we would encounter a lot in data science or machine learning problems.

booster

This argument allows you to choose what boosting algorithm to use, gbtree, dart and gblinear. Both gbtree and dart use tree as weak learners, while gblinear uses linear models. The difference between gbtree and dart is that dart includes a random dropout during training phases, which randomly masks weak learners with a certain probability, to further avoid overfitting.

eta

This argument corresponds to learning rate or shrinkage rate. Please refer to our post Concepts of boosting algorithms in machine learning if needed.

gamma, lambda

These two arguments correspond to the two regularisation hyperparameters to control the size of the weak tree, as explained in the previous session of this post.

max_depth

This argument refers to the maximum depth that a weak tree can have at each boosting iteration. The higher this value is, the more complex each weak tree could become, and more likely it is for the final model to overfit.

max_leaves

This argument refers to the maximum number of leaves that a weak tree can have in each boosting iteration.

subsample

This argument refers to the fraction of rows randomly-sampled for weak tree training at each boosting iteration. This sampling can help reduce overfitting.

colsample_bytree, colsample_bylevel or colsample_bynode

These three arguments controls sampling of columns for weak tree training. This sampling can happen for each tree (bytree), for each tree depth (bylevel) or for each node (bynode).

sampling_method

This argument specifies the sample weight used for training data points. XGBoost supports:

uniform sampling — each data point has an equal weight
gradient-based sampling — each data point is sampled proportional to the weight

\(g + \lambda h^2\)

tree_method

This argument specifies the node splitting algorithm to be used. Node splitting algorithms supported by XGBoost are explained in the previous section Node splitting.