Practical guides on LightGBM

Deep dive into practicalities of LightGBM

Mar 26, 2024

About this post

This post explains practical knowledges when using LightGBM.

For a more conceptual introduction of boosting, please visit another post ‘Concepts of boosting algorithms in machine learning’.

For an introduction on XGBBoost, please visit another post ‘Practical guides on XGBoost.

Welcome to LightGBM's documentation! — LightGBM 4.0.0 documentation

Introduction

Like XGBoost, LightGBM is a powerful implementation of tree-based boosting algorithm.

This article explores the core features of LightGBM, focusing on gradient-based sampling and exclusive feature bundling, and practical considerations with this open source package. By reading this article, you can understand the essence of LightGBM and its differences to other implementations.

The official website for LightGBM can be found here. Its corresponding paper can be found here.

Core features

Newton tree boosting

LightGBM uses Newton tree boosting!

Like XGBoost, LightGBM uses Newton descent at each boosting iteration, although the corresponding paper for LightGBM does not mention the usage of Newton descent. From the implementation on github (search for the static double function GetLeafGain), it can be seen that the leaf gain in a node splitting is evaluated as:

\(L_{k} = -\frac{1}{2} \sum_{i=1}^{N_{\mathrm{node}}} \frac{\mathrm{max}((\sum_{j\in\mathrm{node~j}} g_j)-\lambda_{\mathrm{L1}},0.)^2}{\sum_{j\in\mathrm{node~j}} h_j+\lambda_{\mathrm{L2}}}\)

where apart from regularisation parameters, are the same as the formula implemented in XGBoost.

Please refer to the article ‘Concepts of boosting algorithms in machine learning’ for an introduction to gradient tree boosting and Newton boosting.

Gradient-based one side sampling (GOSS)

GOSS is a way to speed up training time by sampling data based on gradient values!

Specifically, GOSS firstly sorts the data points with the absolute value of their gradients. The top a × 100% points are selected as part of the training data. In addition, GOSS also randomly samples b × 100% instances from the remaining of the data (these are data points with smaller gradients). To compensate for the difference in sampling, GOSS reweighs the sampled data with small gradients by a constant (1−a) / b when calculating the information gain.

This method helps reduce computation time by reducing the number of data points used during tree growing without changing the original data distribution by much.

Node splitting for categorical features

Node splitting with categorical variables are often better handled in LightGBM than with typical methods like one-hot encoding!

One common approach to encode categorical features is one-hot encoding. It is often observed that this approach is not very optimal for tree learning algorithms, especially with categorical features with high cardinality.

LightGBM instead handles categorical features natively. Specifically at node splitting, a categorical feature is split by dividing its categorical into two subset. There are 2^k possible ways to do so. The quality of split is then evaluated for each partition and the best split is then chosen.

This is in many cases a better approach than one-hot encoding. At each split all categorical values are considered and handled, while with one-hot encoding only one categorical value will be handled at each split.

Exclusive feature bundling

This method aims to reduce the number of sparse features during tree growing by regrouping mutually exclusive sparse features into bundles. The bundle of exclusive features into a single feature is called an exclusive feature bundle.

Interesting features

Pairwise linear regression at each leaf

Instead of using the average values in each leaf for predictions, LightGBM can perform linear regressions within each leaf and use these linear model to generate predictions instead. This function can be enabled by the argument linear_tree.

Since Dec 2020 (merge request here), it is possible to use piecewise linear regression in LightGBM. This allows practitioners to use linear regressions at each leave for predictions (instead of using aggregated statistics of target labels).

The implementation is slightly different from the paper ‘Gradient boosting with piece-wise linear regression trees’ (link). In particular in LightGBM, during node splitting, the splitting is determined in the same way as without linear models. After the tree structure has been determined, a linear regression model is fitted in each leaf. In the original paper the splitting also include fitting a linear regression model, leading to a much more computationally intensive procedure.

Tackling overfitting with LightGBM parameters

A common problem from gradient boosting is overfitting. With LightGBM, one can reduce the variance of the model with the following parameters:

Decrease max_bin
Decrease num_leaves
Decrease max_depth
Increase lambda_l1, lambda_l2
Increase min_data_in_leaf
Increase min_sum_hessian_in_leaf
Enable data or feature bagging with bagging_fraction, bagging_freq or feature_fraction

Hyperparameters

While there are many hyperparameters to be tuned in LightGBM, this session explains the most important ones that we would encounter a lot in data science or machine learning problems.

boosting

Similar to the argument booster in XGBoost, this argument allows you to choose what boosting algorithm to use, rf, gbdt and dart.

learning_rate

This argument corresponds to learning rate or shrinkage rate. Please refer to our post Concepts of boosting algorithms in machine learning if needed.

num_iterations

This argument controls the total number of boosting iteration to be carried out.

max_depth

This argument refers to the maximum depth that a weak tree can have at each boosting iteration. The higher this value is, the more complex each weak tree could become, and more likely it is for the final model to overfit.

bagging_fraction

This argument refers to the fraction of rows randomly-sampled for weak tree training at each boosting iteration. This sampling can help reduce overfitting.

linear_tree

This argument controls whether to run piecewise linear regression at each leaf or not. Please refer to previous session for an introduction of the method.

lambda_l1, lambda_l2

These two arguments correspond to the two regularisation hyperparameters to control the size of the weak tree.

drop_rate

This argument is only relevant when boosting is dart. This refers to the probability of dropping a weak tree during a boosting iteration.

Summary

In this article we talked about the algorithm of node splitting in LightGBM, two methods that enables its speed-up during boosting and an interesting feature called pairwise linear regression and a list of important parameters to pay attention to during training. We will continue writing more deep-dive content on this topic so stay tuned!