2 Sample-based MDP solutions

This section explores RL solutions when the environment dynamics \(P(\cdot\mid s, a)\) and rewards are not known. This is also known as the model-free regime.

  1. We overcome the model-free constraint by using sampling (and consistent estimations) to replace expectations.
    • TD learning is the direct extension of iteration-convergence methods.
  2. Policy evaluation + greedy improvement = solving the control problem (recall policy iteration). Policy evaluation methods:
    1. Monte-Carlo: high variance, no Markov conditions; converges to MSE-optimal estimates.
    2. Temporal Difference (TD): low variance, need Markov conditions; converges asymptotically (when run against static data) to DP solution on empirical distribution of data.
    3. TD in asymptotic batched setting = analytic solution of policy evaluation under empirical distribution.
  3. Difference between on-policy v. off-policy (definition 2.4), and online v. offline methods (definition 2.5.
    • Action-value iteration is offline since we do not need to sample-approximate any \(\mathop{\mathbb{E}}\limits_{a\sim \pi(s)}\).
    • Policy-iteration is online since it involves policy evaluation.
  4. The three methods explored here are sampling approximations of the previous chapter’s methods:
    • Monte-Carlo policy iteration = (policy evaluation using MC) + policy iteration: online & on-policy.
    • Temporal-Difference policy iteration (SARSA) = (policy evaluation using TD) + policy iteration: online & on-policy.
    • \(Q\)-learning 2.9 = action-value iteration.
  5. \(Q\)-learning is fundamentally a action-value iteration engine; online versions (2.1) merely serve to obtain more relevant samples.
  6. Deep \(Q\)-learning = (\(Q\)-function approximation) + (sampling approximation) + (incremental online iteration to obtain relevant samples) + (action-value iteration).
    • Target network & experience replay buffers.

Model-free policy evaluation

We first consider the policy evaluation problem \(\pi \mapsto Q^\pi\). We are more interested in \(Q^\pi\) since \(\mathrm{argmax}\) directly yields a greedy policy; in contrast, there’s no way to build a greedy policy from \(V^\pi\) in the model-free regime.

Definition 2.1 (evaluation criteria) We want our evaluation (estimation) protocol to be:

  1. Consistent: with enough data, estimate converges to the true value. \[ \lim_{n\to \infty} \mathbb P(|\hat \theta_n - \theta|>\epsilon_{\forall}) = 0 \]
  2. Computationally feasible: updating estimates is easy.
  3. Empirically estimate: MSE, bias, and variance.

Recall that the return is defined as \[ G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots \]

Definition 2.2 (Monte-Carlo Policy Evaluation) Given the following assumptions:

  1. All trajectories are finite.
  2. States and dynamics need not be Markov.

Compute \(V(s)\) as follows:

  1. Sample episode \(j = ((s_{jk}, a_{jk}, r_{jk}))_k\).
  2. For the first occurence of each \(s\) in each episode, accumulate the empirical reward \(G_{jt}\mid s_{jt}=s\). Return the average.

Alternative to the first-visit method above, we can also compute the every-visit MC: e.g. for the first trajectory \(s_0, s_1, s_0, \dots\), we’ll accumulate two samples \(G_0, G_2\) for \(V(s_0)\). We may also consider incremental MC by choosing a LR and set \[ V^\pi(s_{jt}) \leftarrow V^\pi(s_{jt}) + \alpha[G_{jt} - V^\pi(s_{jt})] \]

Remark. Monte-Carlo properties:

  1. First-visit MC is unbiased and consistent.
  2. Every-visit MC is biased but consistent, and usually smaller MSE (better sample efficiency).
  3. Incremental MC converges for \(\sum \alpha_j = \infty, \quad \sum \alpha_j^2 < \infty\).
  4. Limitations of MC: (1) high-variance, and (2) requires finite horizon: trajectories must end before updates can be computed.

Recall the consistency relations \[\begin{align} V^\pi_h(s) &= % \mathop{\mathbb{E}}\limits_{\substack{a\in \pi(\cdot\mid s) \\ s'\in P(\cdot\mid s, a)}}[r(s, a) + \gamma\, V^\pi_{h+1}(s')] \\ Q^\pi_h(s, a) &= r(s, a) + \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} Q^\pi_{h+1}(s', a') \end{align}\] This suggests that we should nudge \(V^\pi(s) \leftarrow \mathcal J^\pi_{V^\pi}(s)\); expectations can be replaced by sampling.

Definition 2.3 (TD(0) algorithm) Given \(\pi\), estimate \(V^\pi\) as follows:

  1. Initialize \(V^\pi =0\).
  2. Repeatedly sample tuple \((s_t, a_t, r_t, s_{t+1})\) and compute \[ V^\pi(s_t) \leftarrow V^\pi + \alpha([r_t + \gamma V^\pi(s_{t+1})] - V^\pi(s_t)) \] One can also imagine “unrolling” the consistency relations twice (or more) before replacing expectations with sampling. This is the idea behind TD(\(\lambda\)) learning. More unrolls nudge the algorithm closer to MC sampling, which has lower bias but higher variance. The same can be done for \(Q^\pi\) by sampling \((s_t, a_t, r_t, s_{t+1}\); the expectation over \(a'\) can be computed analytically since we have access to \(\pi\).
  1. TD is a combination of Monte-Carlo (one-step) and dynamic programming methods!
  2. TD can be used for episodic or infinite-horizon settings.
  3. Markov condition is necessary since this underpins the validity of the Bellman consistency relations.
  4. Generally lower variance.
  5. Converges for \(\alpha_j\) subject to the same conditions in MC.
  6. If \(\alpha=1\), then TD in MDPs with stochastic choices of actions might oscillate forever (\(\alpha=1\) results in really bad.

Let us consider the batch learning scenario, where we have collected a finite number of samples and run the policy evaluation algorithms to convergence.

  • Note that running TD twice on the same trajectory still generally updates the policy.
  • The two methods will have different behaviors!

Example 2.1 (Sutton & Barto Ex 6.4) Consider two states \(A, B\) with \(\gamma=1\). Suppose we have collected the following \(8\) episodes:

  1. \(A, 0, B, 0\).
  2. \(B, 1\) colleted \(6\) times.
  3. \(B, 0\).

Then both TD and MC converge to \(V(B)=0.75\). However, \(V_{\mathrm{MC}}(A)=0\) while \(V_{\mathrm{TD}}(A)=0.75\).

Theorem 2.1 (asymptotic batch MC and TD) In a batched setting, MC converges to values with minimum squared error, while TD(0) converges to DP policy for the MDP with MLE model estimates.

To see why, recall that TD with convergent \(\alpha\) is equivalent to iterated convergence using the consistency operator. Then running TD on batched data is equivalent to iterated policy evaluation on the empirical (MLE) distribution of the environment. In the example above, the DP structure “baked in” to the TD algorithm empowered more data-efficient results.

Model free control

Definition 2.4 (on (off)-policy algorithms) A learning algorithm is on policy if its update rule depends on samples using the current policy.

Definition 2.5 (online (offline) algorithms) An online learning algorithm requires interaction with the environment during learning, while an offline algorithm operates with a static dataset.

  1. Online (offline) asks the question: does this interact with the environment during learning?
  2. On-policy asks the question: must update depend on data computed using the latest policy?

In the previous section, we have seen how Monte-Carlo and TD can help evaluate policies; we have also seen how policy iteration 1.10 implies that policy evaluation + policy improvement (acting greedily w.r.t. evaluation) iterates to solve the control problem .

Definition 2.6 (Monte-Carlo policy iteration) MCPI combines \(Q\)-evaluation using MC with policy iteration; it is online, on-policy. Algorithm:

  1. Initialize \(Q=0\) and \(\pi\) acting \(\epsilon\)-greedily w.r.t. \(Q\).
  2. Repeat until convergence:
    • Sample trajectory \(\tau\sim \rho^\pi\) and extract \(\{(s_t, a_t, G_t)\}\) from trajectory.
    • Update \(Q(s_t, a_t) \leftarrow \bar \alpha Q(s_t, a_t) + \alpha G_t\) (these two steps can be repeated before proceeding).
    • Update \(\pi \leftarrow \epsilon\text{-greedy}(\pi^{Q})\).

GLIE Monte-Carlo PI is guaranteed to converge to the optimal policy.

Definition 2.7 (GLIE property) A learning strategy is Greedy in the Limit of Infinite Exploration (GLIE)if

  1. All state-action pairs are visited an infinite number of times.
  2. Behavior policy converges to greedy policy.

A simple \(\epsilon\)_scheduling GLIE policy sets \(\epsilon_k = 1/k\).

Definition 2.8 (temporal-difference policy iteration) (policy eval using TD) + (policy iteration) is also known as SARSA. It is an online, on-policy algorithm:

  1. Initialize \(Q=0\) and \(\epsilon\)-greedy \(\pi\).
  2. Repeat until convergence:
    • Observe \((s_t, a_t, r_t, s_{t+1}, a_{t+1})\).
    • Update \(Q(s_t, a_t) = \bar \alpha Q(s_t, a_t) + \alpha \left[ r_t + \gamma Q(s_{t+1}, a_{t+1}) \right]\).
      • Note that alternatively, we only need \(s_{t+1}\) and replace \(\gamma Q(s_{t+1}, a_{t+1}) \mapsto \gamma \mathop{\mathbb{E}}\limits_{a_{t+1}\sim \pi(s_{t+1})} Q(s_{t+1}, a_{t+1})\).
    • The previous two steps can be repeated multiple times before proceeding.
    • Update \(\pi \leftarrow \epsilon\text{-greedy}(\pi^{Q})\).

SARSA converges optimally assuming \(\pi_t\) is GLIE (e.g. if \(\epsilon_t \propto 1/t\)) and if \(\alpha_t\) is Robbins-Munro (\(\|\alpha_t\|_2 < \infty = \|\alpha_t\|_1\)).

Definition 2.9 (Q-learning) \(Q\)-learning is the sampling variant of action-value iteration (definition 1.9). In its original form, it is offline, off-policy.

  1. Initialize \(Q=0\).
  2. Given a static list of trajectories, repeat until convergence:
    • For each \((s, a, r, s')\):
    • \(Q(s, a) \leftarrow \bar \alpha Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') \right]\).
  3. Repeat \(Q\)-greedy policy.

Note that \(Q\)-learning in the asymptotic batch-setting limit is conceptually equivalent to solving MDP control on the empirical environment distribution.

Proposition 2.1 (variants of Q-learning) Although \(Q\)-learning (sampling action-value iteration) is offline, to obtain more effective samples, a more prominent online version combines value iteration with policy iteration; this version is online, off-policy:

  1. Initialize \(Q=0\) and random policy \(\pi\).
  2. Repeat until convergence:
    • Generate trajectories using \(\pi\).
    • Repeat for a certain number of steps (or until convergence):
      • For each \((s, a, r, s')\) in trajectory, update \(Q(s, a)\).
    • Update \(\pi\) to be an \(\epsilon\)-greedy policy w.r.t. \(Q\).

DQN (Deep Q Networks) build on the online version above with:

  1. Function approximation of \(Q\) using neural networks.
  2. To mitigate the non-i.i.d. data, use experience replay buffers.
  3. To mitigate nonstationary target in the Bellman optimality operator, update the \(Q\)-target slowly to improve stability.
    • Intuition: while the Bellman optimality operator is a contraction, using functional approximation might break this property and lead to oscillation / divergence.