Exploration

Multi-armed bandit (MAB) = MDP with memoryless environment and one state. It provides a reduced case for analyzing exploration v. exploitation.
- Generalization to \(|\mathcal S|>1\) states = contextual bandit = MDP with memoryless environment.
Lai-Robbins lower bound (theorem 5.1): regret grows at least as \(\log t\). This is an exemplary proof using deviations theory.
Optimism dominates pessimism in face of uncertainty:
- UCB algorithm and regret bound 5.3.
- LCB algorithm is inherently greedy.
Bayesian setup: bayesian regret and Thompson sampling, which implements probability matching.

Multi-arm bandit

Definition 5.6 (multiarmed bandits) A multi-armed bandit consists of a known set of \(m\) actions, \(\mathcal R^a(r)=\mathbb P(r\mid a)\) an unknown yet stationary distribution over rewards. At each step, the agent selects action \(a_t\in \mathcal A\) and the environment generates a reward. The goal is the maximize the cumulative reward \(\sum_{\tau=1}^t r_\tau\).

Definition 5.7 (regret) Consider the following definitions; note that recall is a statistical average definition, and that minimizing regret = maximizing expected cumulative reward.

Action-value \(Q(a)=\mathbb E[r\mid a]\) denotes the mean reward for action \(a\).
Optimal value \(V^* = Q(a^*) = \max_a Q(a)\)
Regret is the opportunity loss for one step \(I_t = \mathbb E[V^* - Q(a_t)]\).
Cumulative regret \(L_t = \sum_{\tau \leq t} I_\tau\).

Some auxiliary definitions:

Average payoff \(\mu_a = \mathbb E_{\mathcal R^a}(r)\).
Gap \(\Delta_a = \mu_{a^*} - \mu_a>0\) for each sub-optimal arm.
Pull counts \(N_a(t)\). Under this definition, we have cumulative regret \[ L_t = \sum_a \Delta_a \cdot \mathbb E\left[N_a(t)\right] \]

Lai-Robins lower bound

Remark (proof strategy). Proof for the following theorem is adapted from appliedprobability.blog. It is an exemplary proof from large deviations theory:

Fixing a suboptimal arm \(a\) and bound sequence \(n_t\), we wish to prove \(P(N_a(t) \leq n_t) = o(1)\)
Construct an alternate measure \(P'\) under which \(E_t=\{N_a(t) \leq n_t\}\) is \(P\)-likely yet \(P'\)-unlikely.
Key reduction: let \(C_t\) denote the event that empirical KL is \(\epsilon\)-close to true KL, then \[ P(E_t) = P(E_t \cap \overline{C_t}) + P(E_t \cap C_t) \]
- The first term is \(o(1)\) by WLLN.
- The second term can be converted to a \(P'(E_t\cap C_t)\) by Radon-Nikodym derivative, which is bounded by \(C_t\) assumption.

Theorem 5.1 (Lai-Robbins lower bound) For any algorithm such that the number of suboptimal pulls is \(o(t^{\eta\in (0, 1)})\), we have a logarithmic lower bound on regret: \[\begin{align} L_t \geq (1-\eta) \log t \sum_{a\neq a^*} \dfrac{\Delta_a}{D(\mathcal R^a \| \mathcal R^{a^*})} \end{align}\]

Proof: Fixing any suboptimal arm \(a\), construct a new bandit instance in which \(a\) is optimal, i.e. \(\tilde {\mathcal R}^a\) satisfies \(\mathbb E[\tilde {\mathcal R}^a] > \mu_{a^*}\). Let the original measure be denoted \(P\) and the new measure \(P'\). Fix a bound sequence \(n_t\); elementary probability yields \[\begin{align} \mathbb E_P[N_a(t)] &\geq n_t \left(1 - P[N_a(t) \leq n_t]\right) \end{align}\] We seek to choose \(n_t\) such that \(P[N_a(t) \geq n_t]=o(1)\). Upper-bound this term by splitting it into two cases:

The first case denotes the probability that the deviation between \(P\) and \(P'\) is not close to that given by KL. goes to \(\delta_t = o(1)\) by weak law of large numbers.
The second case isolates a case where the deviation between \(P, P'\) is \(\epsilon\)-close to that given by KL.

Note that conditioning on actions, the bandit environment tensorizes; let \(D(r_k)=\log \dfrac{P'(r_k)}{P(r_k)}\) denote the random variable such that \(\mathbb E[e^D] = dP'/dP\) and \(\mathbb E[D] = D(P'\|P)\), then \[\begin{align} P[N_a(t) \leq n_t] &= P\left[ N_a(t) \leq n_t \text{ and } \left| D(P\|P') - \dfrac 1 {n_t} \sum_{k=1}^{n_t} D(r_k) \right| > \epsilon \right] \\ &\quad + P\left[ N_a(t) \leq n_t \text{ and } \left| D(P\|P') - \dfrac 1 {n_t} \sum_{k=1}^{n_t} D(r_k) \right| \leq \epsilon \right] \\ &\leq P\left[ \left| D(P\|P') - \dfrac 1 {n_t} \sum_{k=1}^{n_t} D(r_k) \right| > \epsilon \right] \\ &\quad + \mathbb E_{P'}\left[ \exp \left( \sum_{k=1}^{n_t} D(r_k) \right)_{\leq n_t [D(P\|P')+\epsilon]} P' \left[ N_a(t) \leq n_t \right] \right] \\ &\leq \delta_t + \exp \left( n_t [D(P\|P') + \epsilon] \right) P'[N_a(t) \leq n_t] \end{align}\]

It remains to bound \(P'[N_a(t) \leq n_t]\) by Markov’s inequality and choose \(n_t\): w.l.o.g suppose that the algorithm pulls the suboptimal arm at rate \(o(t^\eta)\) for \(\eta\in (0, 1)\), we have \[ P'[N_a(t) \leq n_t] = P'(t - N_a(t) \geq t - n_T) \leq \dfrac{\mathbb E_{P'}[t - N_a(t)]}{t-n_t} = o(t^{\eta - 1}) \] Substituting, we obtain the bound on \[\begin{align} P'[N_a(t) \leq n_t] &\leq \delta_t + \exp \left( n_t [D(P\|P') + \epsilon] \right) o(t^{\eta - 1}) \end{align}\] The largest choice of \(n_t\) which keeps the whole bound \(o(1)\) is \[ n_t = \dfrac{1-\eta}{D(P\|P') + \epsilon} \log t, \quad \mathbb E_P[N_a(t)] \geq n_t (1 - o(1)) \] Take \(\epsilon\to 0\); also note that by chain rule, \(D(P\|P') = \mathbb E_P[N_a(t)] D(\mathcal R^a \|\tilde {\mathcal R}^a) \geq D(\mathcal R^a \|\tilde {\mathcal R}^a)\). Take \(\tilde {\mathcal R^a}\to \mathcal R^{a^*}\) and substituting into the regret formula yields the desired bound.

Upper confidence bound method

The bandit setup highlights the tradeoff between exploration and exploitation. Note that \(\epsilon\)-greedy methods for fixed \(\epsilon\) can invoke linear regret:

If \(\epsilon>0\), regret \(\propto \epsilon/m \min_{a\neq a^*} \Delta_a\)
If \(\epsilon=0\), we might be pulling a suboptimal arm greedily forever.
In general, the decay schedule of \(\epsilon\) is problem-dependent to achieve sublinear regret.

Theorem 5.2 (Hoeffding's inequality) Given i.i.d. random variables \((X_j)\in [0, 1]\) and let \(\bar X_n\) be the sample mean, then \[ P\left[\mathbb E[X] > \bar X_n + u\right] \leq \exp(-2nu^2) \] Rearrange and use symmetry to obtain a bound on the absolute value difference \[ P\left[\left|\mathbb E[X] - \bar X_n\right| > \sqrt{\dfrac{\log(2/\delta)}{2n}}\right] \leq \delta \]

Definition 5.8 (UCB algorithm) Given a MAB (multi-arm bandit) with \(m\) arms:

Sample once from each arm.
Let \(N_a\) denote the number of times arm \(a\) has been pulled, and \(\hat Q(a)\) be the sample mean reward of arm \(a\), choose \[ a_t = \operatorname*{arg\,max}_{a} \hat Q(a) + \sqrt{\dfrac{\log(2/\delta_t)}{2n_a}} \equiv \operatorname*{arg\,max}_{a} \mathrm{UCB}(a, t), \quad \delta_t = 1/t^2 \]

Theorem 5.3 (UCB regret bound) Using the UCB algorithm as in 5.8, the regret is \[ L_t = \sum_{\Delta_a \neq 0} \Delta_a \mathbb E[n_a(t)] \leq \Omega\left( \log t \sum_{\Delta_a\neq 0} \dfrac 1 {\Delta_a} \right) \]

Assuming that bounds are satisfied, analyze relation between \(\delta_t, n_a, Q(a), \hat Q(a), Q(a^*)\)

For a suboptimal arm \(a\) to be pulled at time \(t\), we have \(\mathrm{UCB}(a, t) \geq \mathrm{UCB}(a^*, t)\). We next break the scenario into two cases:

High-probability case: The bound for \(a\) hold on both sides, and \(\mathrm{UCB}(a^*, t) \geq Q(a^*)\). Then \[ \hat Q(a) - \sqrt{\dfrac{\log(2/\delta_t)}{2n_a}} \leq Q(a) \leq \hat Q(a) + \sqrt{\dfrac{\log(2/\delta_t)}{2n_a }} \] Additionally, the upper bound must exceed that of the optimal arm; optimal arm holds as well, we have \[\begin{align} Q(a^*) \leq \mathrm{UCB}(a^*, t) \leq \hat Q(a) + \sqrt{\dfrac{\log(2/\delta_t)}{2n_a }} \leq Q(a) + 2 \sqrt{\dfrac{\log(2/\delta_t)}{2n_a }} \end{align}\] Rearrange to yield \[ Q(a^*) - Q(a) = \Delta_a \leq 2 \sqrt{\dfrac{\log(2/\delta_t)}{2n_a }} \implies n_a \leq \dfrac{2}{\Delta_a^2} \log \dfrac{2}{\delta_t} \]
Low-probability case: bound for \(a\) doesn’t hold when \(a\) is chosen at time \(t\), or if \(\mathrm{UCB}(a^*, t) \leq Q(a^*)\); in this case there’re no guarantees on the number of times \(a\) could have been pulled so the worst case is \(t\).
- The probability that the bound for \(a\) doesn’t hold when \(a\) is chosen at time \(t\) is just \(\delta_t\).
- For the second component, \(P(\mathrm{UCB}(a^*, t) \leq Q(a^*))\) is upper bounded by the probability that the bound for \(a^*\) doesn’t hold for any of time steps \(1, \dots, t\), which is \(\sum_{\tau=1}^t \delta_\tau\).

Combining the bounds and choose \(\delta_t = 1/t^2\) so that \(\sum_{\tau=1}^t \delta_\tau = \Omega(1/t)\), we obtain \[\begin{aligned} \mathbb E[n_{a\neq a^*}] &\leq \mathbb E\left[n_{a\neq a^*} \mid \text{high-prob case}\right] + n P(\text{low-prob case}) \\ &\leq \dfrac{2}{\Delta_a^2} \log \dfrac{2}{\delta_t} + n\left( \delta_t + \sum_{\tau=1}^t \delta_\tau \right) \\ &\leq \dfrac{2}{\Delta_a^2} \log \dfrac{2}{\delta_t} + t^{-1} + t\, \Omega(t^{-1}) = \Omega \left(\dfrac{\log t}{\Delta_a^2}\right) \end{aligned}\]

Remark (LCB algorithm). Note that an algorithm which selects the action with greatest lower confidence bound is inherently greedy. When \(Q(a^*)>Q(a)>\mathrm{UCB}(a)>\mathrm{UCB}(a^*)\), the algorithm never explores \(a^*\).

Bayesian bandit

While frequentist inference assumes that there is a fixed distribution parameter, Bayesian inference assumes that there is a prior over parameters. Regret is computed similarly as

\[ L^{\mathrm{Bayes}}_t = % \mathop{\mathbb{E}}\limits_{\substack{\theta\sim p_\theta \\ a_t\sim \tau}} \left[ \sum_{\tau=1}^t Q(a^*) - Q(a_t) \bigg| \theta \right] \]

Definition 5.9 (probability matching) In a Bayes setting, a probability matching algorithm selects action \(a\) according to the probability that \(a\) is the optimal action, given the current history \[ \pi(a\mid h_t) = P\left[ Q(a) > Q(a'\neq a) \mid h_t \right] \]

Note that \(Q(a), Q(a')\) here are inherently scalars.
The probability is taken over distribution of reward distribution parameters \(\{\theta_a\}\), conditioned upon history \(h_t\).

Definition 5.10 (Thompson sampling) Given priors \(p(\mathcal R_a)\), for each iteration repeat:

For each arm, sample a reward distribution parameter from the parameter posterior \(\hat \theta_a \sim \left(p(\mathcal R_a)\mid h_t\right)\).
Compute the sampled distribution’s mean \(Q(a) = \mathbb E_{\theta_a} \mathcal R_a(\theta_a)\).
Select \(a = \operatorname*{arg\,max}_{a} Q(a)\) and observe reward.
Update posterior for arm \(a\).

Note that probability matching is inherently optimistic in case of uncertainty. Also note that we sample a reward distribution parameter and analytically compute the mean reward, given that parameter.