1 Markov Decision Processes

This section explores solutions to finite and infinite-horizon MDPs with finite state and action spaces, assuming their dynamics and rewards are totally known.

MDP definition 1.1 and trajectory measure $\rho^\pi$ 1.2.
Definition of value $V(s)$ and action-value $Q(s, a)$ functions 1.3; their inter-relation yields recursive definitions 1.1, which are succinctly encoded in the Bellman consistency operators 1.4.
- Note that $\mathcal J^\pi_v$ samples $a$, then $s'$ dependent upon $a$. On the other hand, $\tilde {\mathcal J}^\pi_q$ samples $s'$, then $a'$ dependent upon $s'$.
- This difference allows $Q$-function relations ($s'$ first) to be conveniently sampled from the environment, justifying its prominence in fitted-DP methods. In a sense, $Q(s, a)$ natively encodes “one more layer” of the environment dynamics.
The policy-evaluation problem $\pi \mapsto V^\pi$ is solved by backward induction (finite horizon) or iterated convergence (infinite horizon); the latter case results from the contraction properties of consistency operators (theorem 1.3).
Optimality relations for $V^*, Q^*$ (theorem 1.1), which are encoded in optimality operators 1.6; optimality operators are also shown to be contractions 1.3.
Greedy policy (definition 1.7). It is also shown that $V^{\pi^{V^*}} = V^*$: acting greedily w.r.t. $V^*$ accumulates value $V^*$.
- Note that $\pi^q$ can be computed without knowledge of environment dynamics.
Finite-horizon MDPs are solved by backward induction using the optimality relations (definition 1.9).
Infinite-horizon MDP control problems are solved by iterative approaches:
- Value iteration (proposition 1.2) iterate $v_j\to V^*$ using the optimality operator, then act greedily w.r.t. $v_T$.
  - Acting greedily w.r.t. a candidate value evaluation (which might not be a consistent value function) does not guarantee the payoff: $\gamma$-dependent bound 1.4.
- Policy iteration (theorem 1.5): iterate $\pi \mapsto \pi^{V^\pi}$, i.e. compute value of current policy, then update policy with one more step of foresight by making locally optimal choices according to the computed value.
- Intuitively, each iteration gives “one more step of foresight” to the associated quantities.
VI and PI highlight two fundamentally different methods to solving the control problem:
- VI: recurse the optimality operator, then propose a greedy policy.
- PI: evaluate the current policy, then act greedily w.r.t. the evaluation (fundamentally online).

Preliminaries

Definition 1.1 (finite-horizon Markov decision processes (MDP)) A finite MDP component consists of:

A time horizon $H\in \mathbb N\cup \{+\infty\}$ which specifies the number of interactions.
State space $\mathcal S$. A single state is denoted $s$.
Action space $\mathcal A$.
An initial distribution $\mu \in \Delta(\mathcal S)$.
State transition $P:\mathcal S\times \mathcal A\to \Delta(\mathcal S)$.
A reward function $r:\mathcal S\times \mathcal A\to \mathbb R$. We consider deterministic reward functions.

Definition 1.2 (policy, trajectory) A policy is a mapping from state to action: \[ \pi:\mathcal S\to \Delta(\mathcal A) \] A trajectory is a tuple of interactions: \[ \tau = (s_0, a_0, r_0, \dots, s_{H-1}, a_{H-1}, r_{H-1}) \] A policy $\pi$ induces a measure $\rho^\pi$ on trajectories given by \[\begin{align} \log \rho^\pi(\tau) &= \log \left[\mathbb P(s_0, a_0, r_0) \prod_{t=1}^{H-1} \mathbb P(s_t, a_t, r_t | s_{t-1}, a_{t-1})\right] \\ &= \log \mu(s_0) + \log \pi(a_0|s_0) + \sum_{t=1}^{H-1} \log P(s_t|s_{t-1}, a_{t-1}) + \log \pi(a_t|s_{t-1}) \end{align}\]

To accomodate $H=\infty$, we can also introduce a discount factor $\gamma\in [0, 1]$. The core goal of an RL algorithm is to find the policy \[ \pi^* = \operatorname*{arg\,max}_{\pi }\mathop{\mathbb{E}}\limits_{\rho^\pi}[G(\tau)], \quad G(\tau) = \sum_{t=0}^H \gamma^t r_t \] Note that the total reward $R$ is a $\tau$-dependent random variable.

Value functions

Given a policy, we need to evaluate its utility.

Definition 1.3 ((action-) value function) Given policy $\pi$, its value function $V^\pi:\mathcal S\to \mathbb R$ is defined by \[ V^\pi_h(s) = \mathop{\mathbb{E}}\limits_{\tau\sim \rho^\pi}[R_h(\tau)\mid s_h=s] \] here $R_h=r_h+\dots+\gamma^{\dots} r_{H-1}$ is the truncated reward. Similarly, the action-value function is \[ Q^\pi_h(s, a) = \mathop{\mathbb{E}}\limits_{\tau\sim \rho^\pi}[R_h(\tau)\mid s_h=s, a_h=a] \] Note that the finite horizon makes $V^\pi_h\neq V^\pi_{h-1}$; the $h$-dependence can be dropped when $h=\infty$.

Proposition 1.1 $V$ and $Q$ are related by $Q_h\to V_h, V_{h+1}\to Q_h$ \[\begin{align} V^\pi_h(s) &= \mathop{\mathbb{E}}\limits_{a\in \pi(\cdot\mid s)}Q^\pi_h(s, a) \\ Q^\pi_h(s, a) &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\in P(\cdot\mid s, a)}[V^\pi_{h+1}(s')] \end{align}\] Substituting again yields the recursive relations $V_{h+1}\to V_h, Q_{h+1}\to Q_h$. \[\begin{align} V^\pi_h(s) &= % \mathop{\mathbb{E}}\limits_{\substack{a\in \pi(\cdot\mid s) \\ s'\in P(\cdot\mid s, a)}}[r(s, a) + \gamma\, V^\pi_{h+1}(s')] \\ Q^\pi_h(s, a) &= r(s, a) + \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} Q^\pi_{h+1}(s', a') \end{align}\] These are known as the Bellman consistency relations.

Definition 1.4 (Bellman consistency operators) We can rewrite the consistency relations in terms of consistency operators. Define the (value-) consistency operator $v\mapsto \mathcal J^\pi_v$ of type $\mathcal J^\pi_{(-)}: (\mathcal S\to \mathbb R)\to (\mathcal S\to \mathbb R)$. The action-value consistency operator $\tilde{\mathcal J}_{(-)}^\pi$ is defined similarly: \[\begin{align} \mathcal J^\pi_v(s) &= % \mathop{\mathbb{E}}\limits_{\substack{a\in \pi(\cdot\mid s) \\ s'\in P(\cdot\mid s, a)}}[r(s, a) + \gamma\, v(s')] \\ \tilde{\mathcal J}_{q}^\pi(s, a) &= r(s, a) + \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} q(s', a') \end{align}\] It consumes a state-evaluation $v$ and computes a state evaluation. Note that $\mathcal J^\pi_{(-)}$ requires the environment dynamics $P(\cdot\mid s, a)$ to be known. The consistency operators simplify the recursive definitions \[ V^\pi_h = \mathcal J^\pi_{V^\pi_{h+1}},\quad Q^\pi_h = \tilde {\mathcal J}^\pi_{Q^\pi_{h+1}} \]

Optimality

Definition 1.5 (optimal policy) A policy $\pi^*$ is optimal if \[ \mathop{\mathbb{E}}\limits_{s_0\sim \mu} V^{\pi^*}_h(s_0) \geq \mathop{\mathbb{E}}\limits_{s_0\sim \mu} V^{\pi}_0(s_0), \quad \forall \pi \in \Pi \] The optimal (action-) value function is defined analogously \[\begin{align} V^*_h(s) = \max_\pi V^\pi_h(s), \quad Q^*_h(s, a) = \max_\pi Q^\pi_h(s, a) \end{align}\]

Theorem 1.1 (optimality relations) $V^*$ and $Q^*$ again satisfy the recursive relations \[\begin{align} V^*_h(s) &= \max_a Q^*_h(s, a) \\ &= \max_a \left[ r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} V^*_{h+1}(s') \right] \\ Q^*_h(s, a) &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} V^*_{h+1}(s') \\ &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \max_{a'} Q^*_{h+1}(s', a') \end{align}\]

Proof

It suffices to prove $V^*_h(s) = \max_a Q^*_h(s, a)$ since the rest of the results follow from $V$-$Q$ relations. Substitute the definition of $Q^*$ to yield \[\begin{align} V^*_h(s) &= \max_\pi V^\pi_h(s) = \max_{\pi_h} \max_{\pi_{h+}} \mathop{\mathbb{E}}\limits_{a\sim \pi_h} Q^{\pi_{h+}}_{h+1}(s, a) \\ &= \max_{\pi_h} \mathop{\mathbb{E}}\limits_{a\sim \pi_h} \left[ \max_{\pi_{h+}} Q^{\pi_{h+}}_{h+1}(s, a) \right] = \max_a Q^*_h(s, a) \end{align}\]

Definition 1.6 (optimality operators) Analogous to the consistency operators, define the $V$-optimality operator $\mathcal J^*_{(-)}$ and $Q$-optimality operator $\tilde{\mathcal J}^*_{(-)}$ so that \[\begin{align} \mathcal J^*_v(s) &= \max_a \left[ r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} v_{h+1}(s') \right] \\ \tilde {\mathcal J}^*_q(s, a) &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \max_{a'} q(s', a') \end{align}\]

Remark. Note that the optimality equations have a very heavy dynamic programming flavor: solving $V^*$ (or $Q$) at time $h+1$ yields solutions for time $t$.

Definition 1.7 (greedy policy)

Given action-value candidate $q:\mathcal S\times \mathcal A\to \mathbb R$, define the greedy policy $\pi^q(a\mid s) = 1_{a=\operatorname*{arg\,max}_{q}(s, a)}$. Note that $q$ doesn’t even have to satisfy the consistency equations.

Theorem 1.2 ($Q^*$-greedy policy is optimal) Given optimal $Q^*$, the greedy policy $\pi^{Q^*}$ is an optimal policy. This is nontrivial because greedy is a local behavior. In other words, \[ V^{\pi^{Q^*}} = V^* \]

Proof

We prove this for the finite-horizon case. Optimality is trivial for terminal $t=H-1$ Fix any $\pi$, inductively, \[\begin{align} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^\pi} [r_h + \dots + \gamma^{\dots}\, r_{H+1}\mid \tau_h] &= % \mathop{\mathbb{E}}\limits_{\substack{a_h\sim \pi_h(s_h) \\ s'\sim P(\cdot\mid s_h, a_h)}}\left[ r_h(s_h, a_h) + \gamma V^\pi_{h+1}(s') \right] \\ &\leq % \mathop{\mathbb{E}}\limits_{\substack{a_h\sim \pi_h(s_h) \\ s'\sim P(\cdot\mid s_h, a_h)}}\left[ r_h(s_h, a_h) + \gamma V^{\pi^{Q^*}}_{h+1}(s') \right] \\ &= \mathop{\mathbb{E}}\limits_{a_h\sim \pi_h(s_h)}Q^*_h(s, a) \leq \max_a Q^{\pi^{Q^*}}_h(s, a) \\ &= V^{\pi^{Q^*}}_h(s) \end{align}\] In short: expand to rewrite using $V_{h+1}$, apply inductive inequality, then regroup.

Definition 1.8 (contraction) We equip the space of $v$ and $q$ functions with the $\sup$-norm $\|\cdot\|_\infty$ given by \[ \|v\|_\infty = \sup_s |v(s)|, \quad \|q\|_\infty = \sup_{s, a} |q(s, a)| \] An operator $\mathcal J$ is a $\gamma\in (0, 1)$ contraction if for any $u, v$, we have \[ \|\mathcal J(v) - \mathcal J(u)\| \leq \gamma \|v-u\| \]

Theorem 1.3 (contraction properties) For any policy $\pi$, the consistency operators $\mathcal J^\pi_{(-)}$ and $\mathcal{\tilde J}^\pi_{(-)}$ are $\gamma$-contractions. The optimality operators $\mathcal J^*_{(-)}$ and $\mathcal{\tilde J}^*_{(-)}$ are also $\gamma$ contractions.

Proof

Recall the consistency operator definition 1.4; apply Jenson’s inequality and definition of the $\sup$-norm: \[\begin{align} |\tilde {\mathcal J}^\pi_q(s, a) - \tilde {\mathcal J}^\pi_{q'}(s, a)| &= \gamma \left| % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} q(s', a') - q'(s', a') \right| \\ &\leq \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} \left| q(s', a') - q'(s', a') \right| \leq \gamma \|q-q'\| \end{align}\] The argument for $V$ follows similarly. For optimality operators, recall definition 1.6 and apply lemma 1.1 (fixing $s')$: \[\begin{align} \left| \tilde {\mathcal J}^*_q(s, a) - \tilde {\mathcal J}^*_{q'}(s, a)\right| &= \gamma \left|\mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \left[ \max_{a_0} q(s', a_0) - \max_{a_1} q'(s', a_1) \right]\right| \\ &\leq \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \left| \max_{a_0} q(s', a_0) - \max_{a_1} q'(s', a_1) \right| \leq \gamma \|q-q'\| \end{align}\]

Lemma 1.1 $|\max_i a_i - \max_j b_j| \leq \max_j |a_j - b_j|$.

Proof

Let $\max_j a_j = a_{j^*}$, then \[ \max_i a_i - \max_j b_j = a_{j^*} - \max_j b_j \leq a_{j^*} - b_{j^*} \leq \max_j |a_j - b_j| \] Swap $a\leftrightarrow b$ to obtain the desired inequality.

Policy evaluation

Policy evaluation answers the following question:

Given a policy $\pi$, what is $V^\pi$ and $Q^\pi$?

Remark (finite-horizon policy evaluation). In the finite horizon case, apply the consistency operators inductively to the trivial base case (terminal timestep). Algorithm:

For $t=H-1\dots 0$ do:

For each state $s$, compute $\mathcal J^\pi_v(s)$ (definition 1.4) by iterating over $a\in \mathcal A$ and $s'\in \mathcal S$.

The algorithmic complexity is $O(H\cdot |\mathcal A|\cdot |\mathcal S|^2)$. The action-value function is defined similarly.

Remark (infinite-horizon policy evaluation). For the infinite horizon case, We use the fixed-point relation $Q^\pi = \tilde{\mathcal J}^\pi_{Q^\pi}$ Since the consistency operator is a contraction, iterative contraction converges with rate $\gamma$. Algorithm: initialize $q_0(s, a)=0$ and iteratively apply $q_{t+1} = \tilde{\mathcal J}^\pi_{q_t}$.

Note that $\|q_0 - Q^\pi\|\leq R/(1-\gamma)$, where $R$ is the uniform reward bound. To achieve $\epsilon$-error, we need \[\begin{align} \|q_t - Q^\pi\| &\leq \gamma^{T(\epsilon)} \|q_0 - Q^\pi\| \leq \epsilon \implies T(\epsilon) \geq \dfrac 1 {1-\gamma} \log \dfrac{R}{\epsilon(1-\gamma)} \end{align}\]

Optimal solutions

The ultimate goal of RL is to answer the following question:

Given a MDP, how to compute the optimal policy?

Definition 1.9 (finite-horizon (action-)value iteration) By theorem 1.2, computing $Q^*$ suffices to yield the optimal policy. The value-iteration method uses the optimality recursion (definition 1.6) to compute $Q^*$:

Algorithm complexity is $O(H\cdot |\mathcal S|^2 \cdot |\mathcal A|)$ when we compute $V^*$. When computing $Q^*$:

Initialize $Q^*_{H-1}(s, a)=r(s, a)$; this is the base case.
For $t=H-2, \dots, 0$ do:
- For each $s, a$, compute $Q^*_t(s, a) = \tilde {\mathcal J}^*_{Q^*_{t+1}}(s, a)$ by iterating over $s'\sim P(\cdot\mid s, a)$ and $a'\in \mathcal A$.
Output the $Q^*$-greedy policy.

Proposition 1.2 (infinite-horizon value iteration) For infinite MDPs, value iteration (VI) proceeds as follows:

Initialize $v_0 = 0$.
Repeat until convergence:
- $v_{t+1} = \mathcal J^*_{v_t}$.
Output the $v_T$-greedy policy $\pi^{v_T}$.

Two remarks:

$v_T$ might not be a consistent Bellman value function.
Moreover, the value function of the $v_T$-greedy policy is, in general, different from $v$
- We only have equality $V^{\pi^{V^*}} = V^*$ in the optimal case.

We can derive an explicit bound on how well greedy policy “reconstructs” its value function:

Theorem 1.4 (greedy policy value bound) Let $\pi^*v$ be the greedy policy w.r.t. $v:\mathcal S\to \mathbb R$ (note that $v$ need not satisfy consistency relations), then \[ \|V^{\pi^v} - V^* \| \leq \dfrac{2\gamma}{1-\gamma} \|v-V^*\| \]

Proof

Assuming deterministic policy and denote $\hat \pi = \pi^v,\hat V = V^{\hat \pi}$ \[\begin{align} V^*(s) - \hat V(s) &= Q^*(s, \pi^*(s)) - Q^{\hat \pi}(s, \hat \pi(s)) \\ &= \left[ Q^*(s, \pi^*(s)) - Q^*(s, \hat \pi(s)) \right] + \left[ Q^*(s, \hat \pi(s)) - Q^{\hat \pi}(s, \hat \pi(s)) \right] \end{align}\] Let $q(s, a) = r(s, a) + \gamma\, \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} v(s')$ so that $\hat \pi(a\mid s) = 1_{a=\operatorname*{arg\,max}_{a} q(s, a)}$, then note that by greedy construction of $\hat \pi$, we have \[ q(s, \hat \pi(s)) \geq q(s, \pi^*(s)) \implies q(s, \hat \pi(s)) - q(s, \pi^*(s)) \geq 0 \] Add this to the first square bracket to obtain \[\begin{align} Q^*(s, \pi^*(s)) - Q^*(s, \hat \pi(s)) &\leq \left[ Q^*(s, \pi^*(s)) - q(s, \pi^*(s)) \right] + \left[ q(s, \hat \pi(s)) - Q^*(s, \hat \pi(s)) \right] \\ &\leq 2\gamma \|v-V^*\| \end{align}\] The last inequality comes from expanding $q$ and $Q^*$. The second term is bounded similarly: \[\begin{align} Q^*(s, \hat \pi(s)) - Q^{\hat \pi}(s, \hat \pi(s)) &= \gamma\, \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \hat \pi(s)} V^*(s') - v(s') \\ &\leq \gamma \|v-V^*\| \end{align}\] Combining the results yields below; rearranging proves inequality, as claimed. \[\begin{align} \|V^* - \hat V\| &\leq 2\gamma \|v-V^*\| + \gamma \|V^* - \hat V\| \end{align}\]

We can alternatively iterate the policy.

Definition 1.10 (infinite-horizon policy iteration) Consider the following algorithm:

Initialize a uniform policy $\pi_0$.
Repeat until convergence:
- $\pi_{t+1} = \pi^{V^{\pi_t}}$.
- In human words: compute the value function of $\pi_t$ and act greedily w.r.t. the value function.

The intuition behind this construction is that for the operator $\mathcal G:\pi \mapsto \pi^{V_\pi}$, the greedy-optimal operator $\pi^{V^*}$ is a fixed point.

Theorem 1.5 (policy iteration converges optimally) Policy iteration converges to the optimal-greedy policy according to the rate \[ \| V^{\mathcal G(\pi)} - V^* \| \leq \gamma \|V^\pi - V^*\| \]

Proof

The proof will be concluded if we show that $V^{\mathcal G(\pi)} \geq \mathcal J^*_{V^\pi} \geq V^\pi$, since together with the contraction property of the Bellman optimality operator (theorem 1.3), we have \[ \| V^* - V^{\mathcal G(\pi)}\| \leq \| \mathcal J^*_{V^\pi} - V^* \| \leq \gamma \|V^\pi - V^*\| \] First note that $\mathcal J^*_{V^\pi} \geq V^\pi$, then \[\begin{align} V^{\mathcal G(\pi)}(s) - V^\pi(s) &\geq V^{\mathcal G(\pi)}(s) - \mathcal J^*_{V^\pi}(s) \\ &= \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \mathcal G_\pi(s))} \left[ V^{\mathcal G(\pi)}(s') - V^\pi(s) \right] \end{align}\] Apply this relationship recursively to yield $V^{\mathcal G(\pi)} \geq V^\pi$. To see why the equation above holds, note that $\mathcal G_\pi(s)$ is exactly the value which maximizes $Q^\pi(s, a)$, i.e. \[\begin{align} \mathcal J^*_{V^\pi}(s) &= \max_a \left[ r(s, a) + \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} V^\pi(s') \right] \\ &= r(s, \mathcal G_\pi(s)) + \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \mathcal G_\pi(s))} V^\pi(s) \end{align}\] Next, compute \[\begin{align} V^{\mathcal G(\pi)}(s) - \mathcal J^*_{V^\pi}(s) &= \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \mathcal G_\pi(s)}\left[ V^{\mathcal G(\pi)}(s') - V^\pi(s') \right]_{\geq 0} \geq 0 \end{align}\] In short, the proof idea is:

We need to show that $V^{\mathcal G(\pi)} \geq \mathcal J^*_{V^\pi}$ to apply the convergence theorem.
We note that $V^{\mathcal G(\pi)} - J^*_{V^\pi}$ s an expectation over $V^{\mathcal G(\pi)} - V^\pi$, so it suffices to show that $V^{\mathcal G(\pi)} \geq V^\pi$.
To show the desired relation above, apply $V^\pi \leq \mathcal J^*_{V^\pi}$ recursively.

Remark. Note that the value function of a deterministic policy can be computed inclosed form for an infinite-horizon, finite-space MDP with deterministic rewards. Consider quantities of types \[ P^\pi\in \mathbb R^{|\mathcal S|\times |\mathcal S|}, \quad r, \mu\in \mathbb R^{|\mathcal S|} \] Here $\langle s'|P^\pi|s\rangle= P(s'\mid s, \pi(s))$. In this notation, the Bellman consistency conditions can be solved in closed form as \[ V^\pi = r + \gamma\, P^\pi V^\pi \implies V^\pi = (I - \gamma\, P^\pi)^{-1} r^\pi \]

Remark (VI vs PI). Note that policy iteration (PI) cannot take more than $|\mathcal A|^{|\mathcal S|}$ iterations since there are only so many deterministic policies. On the other hand, value iteration might take more since different value evaluations $\neq$ different greedy policies.

Consider single-state MDP with $r(s, a)=1, \gamma=.9$ and $V_0=0$. Then $V_1(s)=1$ while $V^*(s)=10$.