What is Curiosity? Mathematically?

Authors

Affiliations

Francesco Sacco

None

Mobile note: this post is meant to be read on a computer. Several figures are interactive and use wide layouts, hover states, sliders, and precise clicks, so they may not work as intended on a phone or small tablet.

This is one of the most important questions of the entire field of Artificial Intelligence.
The whole field of dataset curation will go away once a good mathematical framework of curiosity emerges.

Hopefully, this blog post will be the first part of a series of increasingly complex environments where I explain/derive the mathematical foundations of curiosity.

What is the $K$-armed bandit problem?

The $K$-armed bandit problem is perhaps one of the worst-named pieces of math. However, it has important applications in reinforcement learning. Let's see how it works.

You walk into a room with $K$ slot machines. Each one pays out a noisy reward drawn from its own Gaussian distribution — you don't know the means, you don't know the variances, and you have a finite number of pulls. Your job is to earn as much money (a.k.a. "reward") as possible.

Find the best strategy by hand!

The true reward distributions are hidden at the start. Each pull drops one reward into a bucket; enough evidence makes each arm's empirical violin appear. Try to win as much reward as possible!

Figure 1: Evidence from the same five hidden arms used throughout the page. The agent sees only individual rewards; repeated pulls fill buckets until the observed rewards form a histogram-like violin.

Every pull is a trade-off:

So, how much should we exploit and how much should we explore?

We will see that the answer to this existential question follows easily from the math once we ask ourselves the question:

Given the information available, what is the best action I can take?

Setup: noisy arms, hidden average rewards

Each arm $k\in\{1,\dots,K\}$ has a true expected reward $\mu_k$ and an intrinsic noise level $\sigma_k$, both unknown to us. Pulling arm $k$ produces a reward $$ r_{k,i}\mid\mu_k,\sigma_k^2 \;\sim\; \mathcal N(\mu_k,\sigma_k^2). $$ Figure 2 shows the five arms freshly sampled from the environment used in the study. The thick bar marks the true average reward $\mu_k$; the violin shows the spread of rewards you'd actually see.

Figure 2: Five sampled arms. Each violin is the reward distribution $\mathcal N(\mu_k,\sigma_k^2)$ for one arm; the horizontal bar is $\mu_k$. The agent never sees the violins — only individual reward draws from the arms it pulls.

Reverse-engineering the arms

We do not directly know the true mean and standard deviation of each arm, but after pulling an arm we can collect a dataset: $D_k = \{r_{k,1},\dots,r_{k,n}\}$ of observed rewards. The bigger the dataset, the better our estimate of the true mean and standard deviation. But how do we do this quantitatively?
Just as the Gaussian model gives us the likelihood $p(D_k \mid \mu_k,\sigma_k^2)$, Bayes' rule turns that evidence around into a posterior belief over the unknown parameters: $$ p(\mu_k,\sigma_k^2 \mid D_k) $$

If we focus just on the arm's true average reward, that posterior becomes a Student-$t$ distribution: $$ \mu_k \mid D_k \sim t $$ This is the distribution shown below in Figure 3. Each colored curve represents what the model currently believes about one arm's expected reward, not just the next noisy reward from that arm.

Each time you pull an arm, the available information increases. You can then use that information to improve your estimate of the ground truth mean.

Posterior update details

Bayes' rule says that the posterior is proportional to the likelihood times the prior: $$ p(\mu_k,\sigma_k^2 \mid D_k) \;\propto\; p(D_k \mid \mu_k,\sigma_k^2)\,p(\mu_k,\sigma_k^2). $$ For a Gaussian reward model with unknown mean and variance, we choose the conjugate Normal-Inverse-Gamma prior: $$ \sigma_k^2 \sim \mathrm{InvGamma}(\alpha_0,\beta_0), \qquad \mu_k\mid\sigma_k^2 \sim \mathcal N\!\left(\mu_0,\tfrac{\sigma_k^2}{\kappa_0}\right). $$ The prior term contains the four hyperparameters $(\mu_0,\kappa_0,\alpha_0,\beta_0)$, which encode what we believe before pulling anything. After observing rewards $D_k=\{r_{k,1},\dots,r_{k,n_k}\}$ with sum $s_k$ and squared-sum $q_k$, conjugacy lets us multiply likelihood and prior and stay in the same family: $$ \kappa_{k,n} = \kappa_0 + n_k, \qquad \mu_{k,n} = \frac{\kappa_0\mu_0 + s_k}{\kappa_0 + n_k}, $$ $$ \alpha_{k,n} = \alpha_0 + \tfrac{n_k}{2}, \qquad \beta_{k,n} = \beta_0 + \tfrac12\!\left(q_k + \kappa_0\mu_0^2 - \kappa_{k,n}\mu_{k,n}^2\right). $$

Marginalizing $\sigma_k^2$ out of the Normal-Inverse-Gamma posterior gives the Student-$t$ posterior over the true average reward: $$ \mu_k\mid D_k \;\sim\; t_{\nu_{k,n}}\!\left(\mu_{k,n},\,\tau_{k,n}\right), \quad \nu_{k,n}=2\alpha_{k,n}, \quad \tau_{k,n}=\sqrt{\tfrac{\beta_{k,n}}{\alpha_{k,n}\kappa_{k,n}}}. $$

See how the model estimates the true average reward of each arm $\mu^\star$

Click the buttons on the right to see what happens.

Figure 3: The vertical lines marked with $\mu^\star$ represent the true average reward of each arm.
As you sample each arm, the predicted distribution $p(\mu | D)$ homes in on the true average reward $\mu^\star$.

The right question to ask

Given everything I've seen so far, what is the probability that arm $k$ is the truly best arm?

We can translate that question into the posteriors we just built. For each arm $k$, ask how likely it is that its true average reward is larger than every other arm's estimated average reward: $$ \mathbb P(k^\star = k\mid D) \;=\; \mathbb P\big(\mu_k\ge \mu_j\;\forall j\ne k\,\big|\,D\big). $$

The easiest way to estimate this is by Monte Carlo. There is also another way: assuming the posteriors factorize across arms (as they do), we can collapse it into a one-dimensional integral: $$ \mathbb P(k^\star = k\mid D) \;=\; \int_{-\infty}^{\infty} f_k(x)\!\prod_{j\ne k} F_j(x)\,dx, $$ where $f_k$ is arm $k$'s posterior density and $F_j$ are the other arms' CDFs. In words: weight every candidate value $x$ by how plausible it is for arm $k$, times the probability that every other arm is below it. The recipe is the same one Thompson wrote down in 1933 (algorithm 1) :

On its own, this is just a one-step decision rule. If we put it inside the bandit loop, where every pull adds one more observation to $D$, it becomes an adaptive algorithm: uncertainty drives exploration early on, then the posterior concentrates and the policy naturally settles on the arms that keep looking best (algorithm 2).

def thompson_sampling($D$):

for each arm $k$:

draw $\tilde\mu_k \sim p(\mu_k\mid D_k)$

return $\arg\max_k \tilde\mu_k$

Algorithm 1

def bayes_bandit($D$):

for $t$ = $1, 2, \ldots$:

$a_t$ = thompson_sampling($D$)

pull arm $a_t$ & observe reward $r_t$

append $(a_t, r_t)$ to $D$

return $p(\mu_k \mid D)$

Algorithm 2

Figure 4: A Monte Carlo estimator of $\mathbb P(k^\star=k\mid D)$. Each draw samples one $\tilde\mu_k$ per arm and credits whichever arm came out highest. In the strip, the highlighted draw is the sampled winner during Thompson sampling, or the chosen lever after a manual Figure 3 pull.

As you can see, this algorithm works pretty well, and it's oddly satisfying to watch!

What this teaches us about curiosity

One odd thing jumps out when watching this algorithm run: sometimes it samples points where the expected mean is lower. Why?

Probability of the uncertain arm having a better mean:
0.0%
Figure 5: The known arm has a high but narrow posterior. The uncertain arm has a lower mean, but its wider posterior leaves enough right-tail probability that it can still sometimes look best.

Because the algorithm is not comparing only the means; it is comparing uncertain beliefs.

The arm we have pulled many times has a narrow posterior: we have a pretty good idea of how good it is. An arm we have barely tried has a wider posterior: its average might be lower, but there is still a real chance that it is better than it currently looks.

That chance lives in the right tail. Every now and then, Thompson Sampling draws a sample from that tail, and in that sampled world the uncertain arm looks like the best option. So the algorithm tries it.

This is the key point: Thompson Sampling is not exploring because we added a separate exploration rule. It explores because uncertainty changes what “best” means.
This leads us to the conclusion that actually:

Curiosity is what optimal action looks like when your beliefs are uncertain.

OK, we are basically done. Now let's compare it with other methods.

Comparison with known methods

Comparison is the thief of joy
- Not a reviewer

To benchmark this method, I replicated the 10-armed testbed parameter study with the methods used in Chapter 2 of Sutton & Barto.

Each algorithm has a hyperparameter that balances exploration and exploitation. To keep things short, I'm going to give you a brief overview of each method in the cards below.

UCB

$c$

Add an optimism bonus $c\sqrt{\log t / N_k(t)}$ to each empirical mean; pull the arg-max.

Bayesian P(best)

$\beta_0$

It's the method of this blog post. Sweep the same $\beta_0$ prior-scale slider from Figure 3

Optimistic greedy

$Q_0$

Initialize all $Q$-values high so every arm gets pulled at least a few times.

Gradient bandit

$\alpha$

Learn soft action preferences using reward advantages; sample from the softmax.

$\varepsilon$-greedy

$\varepsilon$

Exploit the empirical best most of the time, explore uniformly with probability $\varepsilon$.

Hover a curve to see the parameter value and average reward at that point.

Figure 6: Average reward over the first 1000 steps as a function of each algorithm's tuned parameter (log-spaced). Each curve sweeps a single hyperparameter; the highlighted point marks each algorithm's best setting. UCB peaks at $c=1$, with Bayesian $\beta_0\!=\!2$ a close second.

UCB wins by $0.0118$, which is $\sim 0.76\%$ better than the Bayesian method.
The fact that the result is so similar is not that surprising, given that the UCB policy looks Bayesian-ish. $$ A_t \;=\; \arg\max_k\!\bigg[\hat\mu_k + c\sqrt{\tfrac{\log t}{N_k(t)}}\bigg]. $$ Both methods are trying to spend samples where uncertainty could still matter. UCB does this explicitly, by adding an uncertainty bonus to the empirical mean. The Bayesian method does it implicitly, by sampling from the posterior and asking which arm wins in that sampled world.

Relationship to other theories

This is not meant to be a new theory of bandits. The point of this blog post is to give an intuition for known methods: curiosity can be understood as the behavior that falls out when an agent acts optimally under uncertain beliefs. The sampling-the-winner idea is closely related to Thompson sampling and posterior probability matching, going back to W. R. Thompson's 1933 paper and modern treatments such as Russo et al.'s tutorial.

The slightly different emphasis here is the best-arm-identification view: instead of asking only which arm has the highest current estimate, we ask for the posterior probability that each arm is truly the best. That framing has also been studied directly in the Bayesian best-arm-identification literature. I find it useful because it makes the "curiosity" story precise: an arm is worth trying when its uncertainty still leaves meaningful probability mass on it being the best.

Let's just jump straight to the conclusion!

Conclusion

In this blog post, we discovered that curiosity is not a bonus term, a mood, or a hand-written urge to try random things:

Curiosity is what optimal action looks like when your beliefs are uncertain

The agent is not exploring because we told it to explore. It explores because uncertainty changes what the “best action” means.

In this blog post, we've explored this idea in the context of the $K$-armed bandit problem. I believe this concept extends beyond this setting.
The next environment I am working on is a chess engine built around the same idea.

In a bandit, curiosity decides which arm deserves another pull. In chess, it decides which branch of the game tree deserves another look: not just the move that currently looks best, but the move whose uncertainty could still change the answer.

I hope you found this interesting. If you want to help out in any way, reach out! 🐝

Acknowledgements

I want to thank Matteo Peluso for the feedback.

This blog post is a write-up I wrote after doing a weekend project. You can find the code for all experiments and insights into my development process in the corresponding GitHub repository.

Contact

Email: francesco215@live.it · Discord: sacco215 · Code: github.com/Francesco215/Bayes-bandit

Citation

For attribution in academic contexts, please cite this work as

Sacco, "What is Curiosity? Mathematically?", 2026. doi: 10.5281/zenodo.20038430

BibTeX:

@article{sacco2026Curiosity,
      author = {Sacco, Francesco},
      title  = {The Bayes Bandit},
      year   = {2026},
      doi    = {10.5281/zenodo.20038430},
      url    = {https://francesco215.github.io/Bayes-bandit/}
    }