Practical guides on bootstrapping

How to use bootstrapping effectively in data science domains? What are the common considerations?

Apr 05, 2024

Consider subscribing to our paid version with a price of one cup of cappuccino per month if you find our content useful!

What is bootstrapping?

The idea of bootstrapping is that statistical inference about a population from a sample data can be approximated by two steps:

resampling the sample data
statistical inference from resampled data

In this following sections we will talk about various types of resampling and statistical inference methods in bootstrapping.

Practical considerations when using bootstrapping

Statistical inference involves drawing statistical conclusions from a finite data sample and extrapolate these conclusions to the whole population.

However in reality, even drawing statistical conclusions from a finite data sample could be tricky in the following ways:

We do not know a-priori the correct probability density function (pdf) for our chosen test statistics.
Even if we know the correct pdf for our test statistic, it may not be in a nice analytical form.

Bootstrapping is particularly advantageous in scenarios where either an analytical expression for the sampling distribution is unavailable or the application of asymptotic theory (e.g., central limit theorem) is uncertain.

If we have a large sample data, this sample can in fact approximate the true underlying pdf.

It is also worth noting that the quality of inference from the resampled data could be known as we can compare these inference results with the ‘true’ population (the original sample data). This could provide a proxy for the inference result that we would like to know for the true population.

Resampling methods — Case bootstrapping

Case bootstrapping is probably the simplest resampling methods. This method involves drawing samples with replacement from the original data set:

Assuming the original data set has n observations
k resampled data with n observation (also called bootstrap sample) will be formed by sampling with replacement, where k is often between 50 to 1000.
The test statistic of interest is calculated for each resampled dataset.

In the end, A distribution of the test statistic of interested (often called the Bootstrap distribution) with size k is generated. With one of the statistical inference method explained below, this distribution can be used to perform hypothesis testing or confidence interval estimation.

Resampling methods — Bayesian bootstrapping

This method is first proposed in this paper in 1981.

Instead of actually sampling the original data set, Bayesian bootstrapping creates bootstrap samples through reweighing the original data set:

Assuming the original data set has n observations
Generate a list of random number, let’s call this list w, in the interval [0,1] with size n-1, sort this list and append 0 and 1 in the start and the end.
Reweigh each data point by the following formula:
\(w(i+1) - w(i)\)
where w(0) = 0 and w(1) = 1
The test statistic of interest is calculated for this reweighed data set.
Repeat this process k times, where k is often between 50 to 1000 (similar to what we have in case bootstrapping)

Resampling methods — Poisson bootstrapping

This method is explained in, for example this paper from Google in 2012.

The idea of this method is based on the intuition that the procedure of case bootstrapping is actually equivalent to assigning a weight vector to each data point, where the weight vector is drawn from multi-binomial distribution:

\(w \sim \mathrm{Multinom}_N(1/N,...,/1/N)\)

It is worth noting that in the limit when N goes to infinity, we have:

\(\lim_{N\to\inf} \mathrm{Binomial}(N,\frac{1}{N}) \sim \mathrm{Poisson}(1)\)

This means that for large dataset, bootstrapping with weights iid-drawn from Poisson(1) will give a similar result as with the case bootstrapping (sampling with replacement).

The advantage with Poisson bootstrapping is that it is not necessary to know the total number of data points in advance during bootstrapping. Hence this method can be easily parallelised in distributed computing environment such as spark.

Statistical inference — Percentile method

The basic method is very simple. After getting the bootstrap distribution of the test statistic, one can simply compute the empirical quantiles and use these as the confidence interval:

\([\hat{\theta}_{\alpha/2},\hat{\theta}_{1-\alpha/2}]\)

where α is the confidence level. In addition, there is also another method which uses the reverse percentile as the confidence interval estimation, i.e.

\([2\hat{\theta}-\hat{\theta}_{1-\alpha/2},2\hat{\theta} -\hat{\theta}_{\alpha/2}]\)

It is worth noting that this method is recommended when the bootstrap distribution does not have a long tail and is mostly symmetric.

Statistical inference — Studentised method

In general, the studentised method is a more accurate method, but comes with a larger computational cost, as it involves two-step bootstrapping This is how it works:

Construct a set of bootstrap sample:
\(x^{(b)}_{1},x^{(b)}_{2},...,x^{(b)}_{n}, b = 1,2,...,B\)
Calculate the test-statistic for each bootstrap sample as other methods
For each bootstrap sample, i.e. b = 1, 2, 3, … , B,
1. Construct another set of bootstrap sample:
  \(x^{(b,m)}_{1},x^{(b,m)}_{2},...,x^{(b,m)}_{n}, m = 1,2,...,M\)
2. Calculate the test-statistic for each new bootstrap sample
3. Calculate the standard deviation based on these M test-statistics
Calculate the t-statistic for each bootstrap sample:
\(t^{(b)} = \frac{\theta^{(b)} - \hat{\theta}}{s^{(b)}}\)
Construct α/2 and 1-α/2 quantile from the boostrapped t-statistic distribution
The confidence interval can be calculated as:
\([\hat{\theta}-sq_{\alpha/2},\hat{\theta}-sq_{1-\alpha/2}]\)

Key takeaways

Advantages of bootstrapping:

It is quite general and can be applicable to a wide range of statistical inference problems
It is conceptually simple to be used to estimate statistical quantities, such as standard errors and confidence intervals, even for very complex test statistics
It can also incorporate different sampling methods easily

Disadvantages of bootstrapping:

It can be computationally intensive
It is subject to the quality of the original data sample
As an empirical method, its assumptions and validity are harder to be examined

Summary

This article summarises a few common methods in bootstrapping, a statistical method for estimating sampling distributions. We explored various resampling methods, such as case bootstrapping, Bayesian bootstrapping and Poisson bootstrapping. We have also talked about a few methods to perform confidence interval estimation given a set of bootstrap samples.