The Bayes Bandit

Mobile note: this post is meant to be read on a computer. Several figures are interactive and use wide layouts, hover states, sliders, and precise clicks, so they may not work as intended on a phone or small tablet.

This is one of the most important questions of the entire field of Artificial Intelligence.
The whole field of dataset curation will go away once a good mathematical framework of curiosity emerges.

Hopefully, this blog post will be the first part of a series of increasingly complex environments where I explain/derive the mathematical foundations of curiosity.

What is the $K$-armed bandit problem?

The $K$-armed bandit problem is perhaps one of the worst-named pieces of math. However, it has important applications in reinforcement learning. Let's see how it works.

You walk into a room with $K$ slot machines. Each one pays out a noisy reward drawn from its own Gaussian distribution — you don't know the means, you don't know the variances, and you have a finite number of pulls. Your job is to earn as much money (a.k.a. "reward") as possible.

We will see that the answer to this existential question follows easily from the math once we ask ourselves the question:

Setup: noisy arms, hidden average rewards

Each arm $k\in\{1,\dots,K\}$ has a true expected reward $\mu_k$ and an intrinsic noise level $\sigma_k$, both unknown to us. Pulling arm $k$ produces a reward $$ r_{k,i}\mid\mu_k,\sigma_k^2 \;\sim\; \mathcal N(\mu_k,\sigma_k^2). $$ Figure 2 shows the five arms freshly sampled from the environment used in the study. The thick bar marks the true average reward $\mu_k$; the violin shows the spread of rewards you'd actually see.

Reverse-engineering the arms

We do not directly know the true mean and standard deviation of each arm, but after pulling an arm we can collect a dataset: $D_k = \{r_{k,1},\dots,r_{k,n}\}$ of observed rewards. The bigger the dataset, the better our estimate of the true mean and standard deviation. But how do we do this quantitatively?
Just as the Gaussian model gives us the likelihood $p(D_k \mid \mu_k,\sigma_k^2)$, Bayes' rule turns that evidence around into a posterior belief over the unknown parameters: $$ p(\mu_k,\sigma_k^2 \mid D_k) $$

If we focus just on the arm's true average reward, that posterior becomes a Student-$t$ distribution: $$ \mu_k \mid D_k \sim t $$ This is the distribution shown below in Figure 3. Each colored curve represents what the model currently believes about one arm's expected reward, not just the next noisy reward from that arm.

Each time you pull an arm, the available information increases. You can then use that information to improve your estimate of the ground truth mean.

The right question to ask

We can translate that question into the posteriors we just built. For each arm $k$, ask how likely it is that its true average reward is larger than every other arm's estimated average reward: $$ \mathbb P(k^\star = k\mid D) \;=\; \mathbb P\big(\mu_k\ge \mu_j\;\forall j\ne k\,\big|\,D\big). $$

The easiest way to estimate this is by Monte Carlo. There is also another way: assuming the posteriors factorize across arms (as they do), we can collapse it into a one-dimensional integral: $$ \mathbb P(k^\star = k\mid D) \;=\; \int_{-\infty}^{\infty} f_k(x)\!\prod_{j\ne k} F_j(x)\,dx, $$ where $f_k$ is arm $k$'s posterior density and $F_j$ are the other arms' CDFs. In words: weight every candidate value $x$ by how plausible it is for arm $k$, times the probability that every other arm is below it. The recipe is the same one Thompson wrote down in 1933 (algorithm 1) :

On its own, this is just a one-step decision rule. If we put it inside the bandit loop, where every pull adds one more observation to $D$, it becomes an adaptive algorithm: uncertainty drives exploration early on, then the posterior concentrates and the policy naturally settles on the arms that keep looking best (algorithm 2).

def thompson_sampling($D$):

for each arm $k$:

draw $\tilde\mu_k \sim p(\mu_k\mid D_k)$

return $\arg\max_k \tilde\mu_k$

Algorithm 1

def bayes_bandit($D$):

for $t$ = $1, 2, \ldots$:

$a_t$ = thompson_sampling($D$)

pull arm $a_t$ & observe reward $r_t$

append $(a_t, r_t)$ to $D$

return $p(\mu_k \mid D)$

Algorithm 2

As you can see, this algorithm works pretty well, and it's oddly satisfying to watch!

What this teaches us about curiosity

One odd thing jumps out when watching this algorithm run: sometimes it samples points where the expected mean is lower. Why?

Good arm accuracy

Probability of the uncertain arm having a better mean:
0.0%

Figure 5: The known arm has a high but narrow posterior. The uncertain arm has a lower mean, but its wider posterior leaves enough right-tail probability that it can still sometimes look best.

Because the algorithm is not comparing only the means; it is comparing uncertain beliefs.

The arm we have pulled many times has a narrow posterior: we have a pretty good idea of how good it is. An arm we have barely tried has a wider posterior: its average might be lower, but there is still a real chance that it is better than it currently looks.

That chance lives in the right tail. Every now and then, Thompson Sampling draws a sample from that tail, and in that sampled world the uncertain arm looks like the best option. So the algorithm tries it.

This is the key point: Thompson Sampling is not exploring because we added a separate exploration rule. It explores because uncertainty changes what “best” means.
This leads us to the conclusion that actually:

Curiosity is what optimal action looks like when your beliefs are uncertain.

Comparison with known methods

To benchmark this method, I replicated the 10-armed testbed parameter study with the methods used in Chapter 2 of Sutton & Barto.

Each algorithm has a hyperparameter that balances exploration and exploitation. To keep things short, I'm going to give you a brief overview of each method in the cards below.

UCB

$c$

Add an optimism bonus $c\sqrt{\log t / N_k(t)}$ to each empirical mean; pull the arg-max.

Bayesian P(best)

$\beta_0$

It's the method of this blog post. Sweep the same $\beta_0$ prior-scale slider from Figure 3

Optimistic greedy

$Q_0$

Initialize all $Q$-values high so every arm gets pulled at least a few times.

Gradient bandit

$\alpha$

Learn soft action preferences using reward advantages; sample from the softmax.

$\varepsilon$-greedy

$\varepsilon$

Exploit the empirical best most of the time, explore uniformly with probability $\varepsilon$.

UCB wins by $0.0118$, which is $\sim 0.76\%$ better than the Bayesian method.
The fact that the result is so similar is not that surprising, given that the UCB policy looks Bayesian-ish. $$ A_t \;=\; \arg\max_k\!\bigg[\hat\mu_k + c\sqrt{\tfrac{\log t}{N_k(t)}}\bigg]. $$ Both methods are trying to spend samples where uncertainty could still matter. UCB does this explicitly, by adding an uncertainty bonus to the empirical mean. The Bayesian method does it implicitly, by sampling from the posterior and asking which arm wins in that sampled world.

Relationship to other theories

This is not meant to be a new theory of bandits. The point of this blog post is to give an intuition for known methods: curiosity can be understood as the behavior that falls out when an agent acts optimally under uncertain beliefs. The sampling-the-winner idea is closely related to Thompson sampling and posterior probability matching, going back to W. R. Thompson's 1933 paper and modern treatments such as Russo et al.'s tutorial.

The slightly different emphasis here is the best-arm-identification view: instead of asking only which arm has the highest current estimate, we ask for the posterior probability that each arm is truly the best. That framing has also been studied directly in the Bayesian best-arm-identification literature. I find it useful because it makes the "curiosity" story precise: an arm is worth trying when its uncertainty still leaves meaningful probability mass on it being the best.

Conclusion

In this blog post, we discovered that curiosity is not a bonus term, a mood, or a hand-written urge to try random things:

The agent is not exploring because we told it to explore. It explores because uncertainty changes what the “best action” means.

In this blog post, we've explored this idea in the context of the $K$-armed bandit problem. I believe this concept extends beyond this setting.
The next environment I am working on is a chess engine built around the same idea.

In a bandit, curiosity decides which arm deserves another pull. In chess, it decides which branch of the game tree deserves another look: not just the move that currently looks best, but the move whose uncertainty could still change the answer.

I hope you found this interesting. If you want to help out in any way, reach out! 🐝

What is Curiosity? Mathematically?

Authors

Affiliations

Date

Code