An Empirical Evaluation of Thompson Sampling

Data Science

In supervised learning, one searches a relationship between data \(X\) and outcome \(Y\)

\[f(X) \rightarrow Y\]

The data \(\left\{X_i, Y_i \right\}\) is given to us.

Bad News: Data is always finite and more data is always better, or at least not worse.

Reinforcement Learning

Good News: Very often we can get more data.

Bad News 1: Our actions \(\left\{A_i \right\}\) are part of the data gathering process (experimental design) and thereby part of the data. \[f(X, A) \rightarrow Y\]

Bad News 2: Gathering data will cost you!

Bad News 3: The relationship is often probabilistic

A big part of reinforcement learning is about how to choose the next action \(A\) without losing too much.

Case in Point

At my tube station there is usually a train at 8.10 and at 8.12. Sometimes 8.12 is a completely empty train, but sometimes it gets held up and is extra full instead.

What do I do?

Multi-Armed Bandit Problem

There is a set of actions \(\left\{ a\right\}\) and each action \(a_i\) gives, given the circumstances \(x_i\) a reward \(r_i\), typically binary

\[P\left(r_i | a_i,x_i\right)\]

How to choose \(a_i\) so I maximise my reward \(r_i\) in the long run, without having wasted too much efforts on bad choices? Exploration versus exploitation.

Some more notation

There exists some model with parameters \(\theta^*\) that is the best model there is \[P\left(r_i | a_i,x_i, \theta^*\right)\] We try to learn \(\theta^*\) via a sequence of updates to the model \(\theta\).

If we were to know \(\theta^*\), we could find the best actions for each situation \[a^* = \mathrm{argmax}_a P\left(r_i | a,x_i, \theta^*\right)\]

We want to minimize the expected regret \[\left<\sum_{i=0}^T \left( r_i\left(a_i^*\right)- r_i\left(a_i\right)\right)\right>_{\theta^*} = R(T)\]

Simple Options

Greedy: Always choose the best given the past experience

Random: Worse than greedy

\(\epsilon\) greedy: Choose always the best, except \(\epsilon\) percent of the time, then go completely random -> very popular because it is simple, works and \[\frac{R(T)}{T} \rightarrow 0\]

Not so simple Options

Upper Confidence Bound (UCB): Choose the option where you are confident that maximises the upper bound of your reward.

Choose arm B

UCB

Reward scales asymptotically as \(R \sim O(\log T)\) which is optimal

Thompson Sampling

UCB is asymptotically optimal,but asymptotical results are somewhat misleading.

Thompson Sampling: Choose the arm proportionally to the chance that it is the best arm

Sample \(\theta_a \sim P(\mathbf{\theta}),\) choose \(\mathrm{argmax_a} \theta_a\) (efficient)

Simulations

\(K\) arms, best arm pays out with probability \(0.5\), the others with probability \(0.5-\epsilon\)

Small \(\epsilon\) means best choice is harder to find, but regret should grow more slowly

Quick Recap Beta distribution

Beta distribution \[ Beta(x, a, b) = \frac{1}{B(a, b)} x^{a -1} (1-x)^{b- 1} \]

Updates

Flat prior (\(a = 1\), \(b = 1\))

Updates

1 Successes out of 3 results (\(a= 2, b= 3\))

Updates

20 Successes out of 60 results (\(a= 21, b = 61\))

Results

Optimistic Thompson Sampling

When the draw of the beta distribution ends up below the mean, just take the mean (slightly greedy)

Overcount/Discount

Overcounting or discounting past results makes the sampling greedier or more explorative. \(a \rightarrow a/\alpha, b \rightarrow \beta/\alpha\)

Trading Regret for Variance

Robustness towards Feedback Delay

Using real(ish) data

A model is being trained to predict preferred articles

Take Home Messages

Mathematical Rigor can be misleading
Experiments can inspire mathematics (Thompson sampling was shortly afterwards proven to be asymptotically optimal)
Purely observational data can sometimes give information about interference effects (causal inference)
Thompson sampling works and is robust for Multi-Armed bandit like problems