Data Science

In supervised learning, one searches a relationship between data \(X\) and outcome \(Y\)

\[f(X) \rightarrow Y\]

The data \(\left\{X_i, Y_i \right\}\) is given to us.

  • Bad News: Data is always finite and more data is always better, or at least not worse.

Reinforcement Learning

Good News: Very often we can get more data.

  • Bad News 1: Our actions \(\left\{A_i \right\}\) are part of the data gathering process (experimental design) and thereby part of the data. \[f(X, A) \rightarrow Y\]
  • Bad News 2: Gathering data will cost you!
  • Bad News 3: The relationship is often probabilistic
  • A big part of reinforcement learning is about how to choose the next action \(A\) without losing too much.

Case in Point

At my tube station there is usually a train at 8.10 and at 8.12. Sometimes 8.12 is a completely empty train, but sometimes it gets held up and is extra full instead.

What do I do?

Multi-Armed Bandit Problem

There is a set of actions \(\left\{ a\right\}\) and each action \(a_i\) gives, given the circumstances \(x_i\) a reward \(r_i\), typically binary

\[P\left(r_i | a_i,x_i\right)\]

How to choose \(a_i\) so I maximise my reward \(r_i\) in the long run, without having wasted too much efforts on bad choices? Exploration versus exploitation.

Some more notation

There exists some model with parameters \(\theta^*\) that is the best model there is \[P\left(r_i | a_i,x_i, \theta^*\right)\] We try to learn \(\theta^*\) via a sequence of updates to the model \(\theta\).

  • If we were to know \(\theta^*\), we could find the best actions for each situation \[a^* = \mathrm{argmax}_a P\left(r_i | a,x_i, \theta^*\right)\]
  • We want to minimize the expected regret \[\left<\sum_{i=0}^T \left( r_i\left(a_i^*\right)- r_i\left(a_i\right)\right)\right>_{\theta^*} = R(T)\]

Simple Options

  • Greedy: Always choose the best given the past experience
  • Random: Worse than greedy
  • \(\epsilon\) greedy: Choose always the best, except \(\epsilon\) percent of the time, then go completely random -> very popular because it is simple, works and \[\frac{R(T)}{T} \rightarrow 0\]

Not so simple Options

Upper Confidence Bound (UCB): Choose the option where you are confident that maximises the upper bound of your reward.

Choose arm B

UCB

Reward scales asymptotically as \(R \sim O(\log T)\) which is optimal

Thompson Sampling

UCB is asymptotically optimal,but asymptotical results are somewhat misleading.

Thompson Sampling: Choose the arm proportionally to the chance that it is the best arm

Sample \(\theta_a \sim P(\mathbf{\theta}),\) choose \(\mathrm{argmax_a} \theta_a\) (efficient)

Simulations

\(K\) arms, best arm pays out with probability \(0.5\), the others with probability \(0.5-\epsilon\)

Small \(\epsilon\) means best choice is harder to find, but regret should grow more slowly

Quick Recap Beta distribution

  • Beta distribution \[ Beta(x, a, b) = \frac{1}{B(a, b)} x^{a -1} (1-x)^{b- 1} \]

Updates

Flat prior (\(a = 1\), \(b = 1\))

Updates

1 Successes out of 3 results (\(a= 2, b= 3\))

Updates

20 Successes out of 60 results (\(a= 21, b = 61\))

Results

Optimistic Thompson Sampling

When the draw of the beta distribution ends up below the mean, just take the mean (slightly greedy)

Overcount/Discount

Overcounting or discounting past results makes the sampling greedier or more explorative. \(a \rightarrow a/\alpha, b \rightarrow \beta/\alpha\)

Trading Regret for Variance

Robustness towards Feedback Delay

Using real(ish) data

A model is being trained to predict preferred articles

Take Home Messages

  • Mathematical Rigor can be misleading
  • Experiments can inspire mathematics (Thompson sampling was shortly afterwards proven to be asymptotically optimal)
  • Purely observational data can sometimes give information about interference effects (causal inference)
  • Thompson sampling works and is robust for Multi-Armed bandit like problems