About this post
This post explains practical knowledges when using XGBoost. For a more conceptual introduction of what XGBoost is under the hood, please visit another post ‘Concepts of boosting algorithms in machine learning’.
Introduction
Extreme Gradient Boosting, XGBoost, is a powerful implementation of tree-based boosting algorithm in python. This note explores its core features of XGBoost, focusing on regularised learning objectives, gradient tree boosting, learning rate parameter, row and column subsampling techniques, and node splitting algorithms. By reading this note, you can understand the essence of XGBoost and its differences to other implementations.
The official website for XGBoost can be found here. Its corresponding paper can be found here.
Core features
While there are a number of key improvements in XGBoost, in this post we focus on the two most important ones: gradient tree boosting with regularisation and approximate node splitting algorithm.
Gradient tree boosting with regularisation
Like gradient tree boosting, at each boosting iteration, a weak tree is fitted and then added to the final predictor.
Specifically, when growing a weak tree, the optimal split at each node is determined when there is a maximum loss reduction as follows:
where
where k represents left, right or no-split,
represents the gradient and hessian for data point j respectively,
represents the regularisation hyperparameters to control the size of the weak tree.
This formula is derived by a minimisation of a second-order Taylor approximation to the regularised loss function. You can find the exact derivation in the original paper.
Node splitting
To find the best split for a given node during tree fitting, a split finding algorithm is needed to determine how to split the data points in this given node to left and right.
In particular, XGBoost provides supports to the exact, approximate and histogram-based splitting algorithm.
Exact algorithm
This algorithm enumerates over all features, and for each feature it iterates over all values to find the split that would maximise the loss reduction (defined in previous section). This algorithm is very computationally intensive. To slightly optimise for performance, XGBoost first sorts the data points with respect to feature values and accumulate gradient and hessian statistics accordingly.
Approximate algorithm
Instead of considering all possible splits from all features, the approximate algorithm iterates on a subset of splits for each feature only.
It is possible to propose candidate splits before and during tree fitting. The former, as known as global splitting, constructs candidate splits before any node splitting procedures. The latter, as known as local splitting, constructs candidate splits for each node splitting. XGBoost supports both methods.
Histogram-based algorithm:
This candidate splitting method is similar to the one used by LightGBM. Refer to practical guide on LightGBM in the same series to know more about this algorithm.
Missing data
In presence of missing data in training data, XGBoost handles this by assigning missing data to the side (left or right) such that the loss reduction is maximal.
Hyperparameters
While there are many hyperparameters to be tuned in XGBoost, this session explains the most important ones that we would encounter a lot in data science or machine learning problems.
booster
This argument allows you to choose what boosting algorithm to use, gbtree
, dart
and gblinear
. Both gbtree
and dart
use tree as weak learners, while gblinear
uses linear models. The difference between gbtree
and dart
is that dart
includes a random dropout during training phases, which randomly masks weak learners with a certain probability, to further avoid overfitting.
eta
This argument corresponds to learning rate or shrinkage rate. Please refer to our post Concepts of boosting algorithms in machine learning if needed.
gamma, lambda
These two arguments correspond to the two regularisation hyperparameters to control the size of the weak tree, as explained in the previous session of this post.
max_depth
This argument refers to the maximum depth that a weak tree can have at each boosting iteration. The higher this value is, the more complex each weak tree could become, and more likely it is for the final model to overfit.
max_leaves
This argument refers to the maximum number of leaves that a weak tree can have in each boosting iteration.
subsample
This argument refers to the fraction of rows randomly-sampled for weak tree training at each boosting iteration. This sampling can help reduce overfitting.
colsample_bytree, colsample_bylevel or colsample_bynode
These three arguments controls sampling of columns for weak tree training. This sampling can happen for each tree (bytree), for each tree depth (bylevel) or for each node (bynode).
sampling_method
This argument specifies the sample weight used for training data points. XGBoost supports:
uniform sampling — each data point has an equal weight
gradient-based sampling — each data point is sampled proportional to the weight
tree_method
This argument specifies the node splitting algorithm to be used. Node splitting algorithms supported by XGBoost are explained in the previous section Node splitting.