Concepts of multi-armed bandits
An essential introduction to key concepts of multi-armed bandits, epsilon-greedy, UCB, Bayesian UCB, Thompson sampling
If you find that our articles are useful for your machine learning journey, consider subscribing to our paid version with a price of one cup of cappuccino per month.
Why multi-armed bandit?
Compared to typical machine learning, (non-contextual) multi-armed bandits, also as known as MAB, have several practical advantages:
Light model serving — Due to model simplicity, MAB typically does not require a separate model call to obtain predictions. In addition, MAB does not require features. It is much easier to serve a MAB model as resources are not needed to build up a feature engineering pipeline and suitable model architectures.
Easy maintenance — Due to its little dependency on other parts of the pipeline (such as feature store, model prediction call), a MAB solution is typically much easier to maintain.
Handle of data drift — Customer behaviours change over time. MAB is typically used as an online learning algorithm, making it a natural solution to handle data drift.
Stochastic Bandit
Let’s start with one of the simplest form of bandits.
Problem statement
Given a fixed number of round T and K arms:
For each round:
A policy (or a model, an algorithm) will select one arm, hence performing an action.
Based on this action, the reward for this action is sampled from the corresponding reward distribution.
The model uses this reward to update itself.
The process repeats T times.
Assumptions
In addition to the formulation, there are two major assumptions in this simplest form of MAB:
We can only observe rewards for the chosen arm at each round
The reward distribution of each arm is independent of each other. When an arm is chosen, the observed reward is sampled from the reward distribution of the chosen arm. Rewards in each round are hence iid.
Care has to be taken for Assumption 2. For data science problems in which data drift is an issue, it may be necessary to:
use other MAB algorithms which could approximately account for drift in data or reward distribution. For example contextual bandits with time-dependent or seasonal features. For an introduction in contextual bandit, please visit our articles here.
Maintain a certain amount of exploration, such as epsilon-greedy algorithm. Concepts of exploitation and exploration will be described in the next section.
Exploitation and exploration
In multi-armed bandits (or in reinforcement learning), exploration and exploitation represent two completely opposite strategies.
Given a bandit problem, exploitation represents choosing the best arm based on current knowledge of the system (which may be incomplete or misleading), while exploration means trying out suboptimal arms to collect extra information about the system.
It is important to note that finding the balance between exploitation and exploration is one of the most crucial points in multi-armed bandit problems. Various algorithms have been designed to tackle this point. In the following sections we will introduce a few of them.
Epsilon-greedy algorithm
Definition
Given a MAB problem with total number of round as T and K arms, the epsilon-greedy algorithm is as follows:
For a given value of epsilon, the epsilon-greedy algorithm is as follows:
For each round t = 1, 2, 3, …, T,
draw a random real-valued number r in the interval of [0,1]
To choose an arm, we do:
If r is less than epsilon, choose an arm randomly
If r is greater than epsilon, choose the arm with the highest average reward
Observe the reward with the chosen action, update the average reward for the chosen arm accordingly.
Key takeaway
Data draft: For a constant epsilon, this algorithm implies that there is always a certain degree of exploration. In case of data drifts, data collected from this constant exploration could be helpful to detect data drift and mitigate some of its bias. However, in a situation with no data drift, this constant exploration could be suboptimal.
Non-adaptive exploration: The algorithm uses non-adaptive exploration. Alternatively, we can adjust epsilon according to the number of played rounds, the number of arms or other parameters. So that the exploration schedule is adapted to the history of observed rewards. There are other MAB algorithms with adaptive exploration schedules, such as UCB (which is explained below).
UCB algorithm
Upper confidence bound (UCB)
Before introducing the UCB algorithm, it is necessary to define UCB (upper confidence bound):
UCB for a given arm a at a given round t is defined as:
where μ is the average of all observed rewards for arm a up to t round, n is the number of times that arm a has been selected up to t round.
Note that the latter term in the expression is an estimate of the confidence interval of the expected reward for arm a up to t round. Derivation of this confidence interval is based on the Hoeffding Inequality, which is explained in Appendix.
We are in the position to define the UCB algorithm.
Definition
Given a MAB problem with total number of round as T and K arms, the UCB algorithm is as follows:
For each round t = 1, 2, 3, …, T,
Pick the arm with the highest upper confidence bound (UCB)
The intuition behind the UCB algorithm is that, there are two scenarios which we would like to select an arm and observe its rewards:
The average reward of this arm is higher than others arm. (Exploitation)
We have not selected this arm enough times to be confident about its average reward. (Exploration)
The first case will give a high value in the first summand in the UCB formula, while the second case will mean a high value in the second summand. Therefore, picking the arm with the highest UCB represents one natural way to balance exploitation and exploration.
Key takeaway
The confidence bound in the UCB algorithm is derived without any prior information on reward distributions. Hence the confidence bound would be a bit conservative.
In case when there are more knowledge a priori on reward distributions (for example they follow probability density functions, such as normal distribution), we can replace the UCB bound with specific confidence bounds derived from this pdf instead.
The UCB algorithm represents one of the most simplest algorithms which takes into account the uncertainty on average reward estimate for each arm. In this algorithm we use UCB as a heuristic to balance exploitation and exploration. There are other UCB variants which further expands on this idea.
Bayesian UCB algorithm
Bayesian UCB is first proposed in this paper.
Unlike UCB which uses a specific formula based on the Hoeffding bound, to estimate confidence bounds, Bayesian UCB provides a more general mathematical framework to include custom priors and perform Bayesian updates on model parameters per round.
For a more general introduction on Bayesian statistics, please visit this article in our data science section.
Definition
For each round t = 1, 2, 3, …, T,
Compute the 1-1/t quantile for each arm as below, where Q is the quantile function with the first argument as the confidence level, and λ as the posterior distribution for mean reward for arm j for round t-1.
\(q_{j}(t) = Q(1-\frac{1}{t},\lambda^{t-1}_j)\)Choose the arm corresponding to the highest quantile value
Observe reward and perform Bayesian update (posterior estimation) on λ for the chosen arm.
Key takeaway
The main advantage of this algorithm is that it is a simple policy as UCB, but still generic enough to incorporate various Bayesian statistical models. This algorithm also has a proper Bayesian interpretation.
In addition, one can choose probability distribution from the one-parameter exponential family (such as normal distribution, Bernoulli distribution or gamma distribution) as their conjugate to simplify the procedure to estimate posterior distributions.
Thompson sampling
Thompson Sampling is one of the most commonly-used bandit algorithms. It is proposed in this paper by Thompson in 1933.
Like Bayesian UCB, it is also a Bayesian bandit algorithm. Let’s look at what this algorithm does.
Definition
For each round t = 1, 2, 3, …, T,
For each arm, sample its posterior distribution and get a sample of mean reward
Choose the arm with the highest sampled mean reward
Observe the resulting reward from this chosen arm
Perform Bayesian updates with this extra pair of data (chosen arm, received reward)
Key takeaways
Although both Thompson sampling and Bayesian UCB are Bayesian bandits, there are some similarities and differences between the two algorithms.
Similarities
At each round, after receiving a new pair of data point (chosen arm, received reward), the posterior distribution is updated and used as the prior for the next round.
Both algorithms can explore because the posterior distributions are not deterministic. The less uncertain the posterior distribution is, the more exploration that would be induced in both algorithms.
Both algorithms require a specification on initial prior distributions. A careful choice of prior can improve performance and convergence rate.
Difference
Bayesian UCB tends to be more optimistic in the beginning due to the schedule of the quantile 1-1/t.
Bayesian UCB uses an aggregated test-statistic, i.e. the (1-1/t)-th quantile of the posterior distribution, while the Thompson sampling uses a sampled value from the posterior distribution only.
Summary
In this article we have covered basic concepts of stochastic multi-armed bandits and a few commonly used MAB algorithms, including epsilon-greedy algorithm, UCB, Bayesian UCB and Thompson sampling. We have talked about the basic concepts of each algorithms and some practical considerations when using them.
However, there are a lot more to talk. Please stay tuned to our in-depth guides for each algorithm (and more)!