1 Markov Decision Processes
This section explores solutions to finite and infinite-horizon MDPs with finite state and action spaces, assuming their dynamics and rewards are totally known.
- MDP definition 1.1 and trajectory measure \(\rho^\pi\) 1.2.
- Definition of value \(V(s)\) and action-value \(Q(s, a)\) functions 1.3; their inter-relation yields recursive definitions 1.1, which are succinctly encoded in the Bellman consistency operators 1.4.
- Note that \(\mathcal J^\pi_v\) samples \(a\), then \(s'\) dependent upon \(a\). On the other hand, \(\tilde {\mathcal J}^\pi_q\) samples \(s'\), then \(a'\) dependent upon \(s'\).
- This difference allows \(Q\)-function relations (\(s'\) first) to be conveniently sampled from the environment, justifying its prominence in fitted-DP methods. In a sense, \(Q(s, a)\) natively encodes “one more layer” of the environment dynamics.
- The policy-evaluation problem \(\pi \mapsto V^\pi\) is solved by backward induction (finite horizon) or iterated convergence (infinite horizon); the latter case results from the contraction properties of consistency operators (theorem 1.3).
- Optimality relations for \(V^*, Q^*\) (theorem 1.1), which are encoded in optimality operators 1.6; optimality operators are also shown to be contractions 1.3.
- Greedy policy (definition 1.7). It is also shown that \(V^{\pi^{V^*}} = V^*\): acting greedily w.r.t. \(V^*\) accumulates value \(V^*\).
- Note that \(\pi^q\) can be computed without knowledge of environment dynamics.
- Finite-horizon MDPs are solved by backward induction using the optimality relations (definition 1.9).
- Infinite-horizon MDP control problems are solved by iterative approaches:
- Value iteration (proposition 1.2) iterate \(v_j\to V^*\) using the optimality operator, then act greedily w.r.t. \(v_T\).
- Acting greedily w.r.t. a candidate value evaluation (which might not be a consistent value function) does not guarantee the payoff: \(\gamma\)-dependent bound 1.4.
- Acting greedily w.r.t. a candidate value evaluation (which might not be a consistent value function) does not guarantee the payoff: \(\gamma\)-dependent bound 1.4.
- Policy iteration (theorem 1.5): iterate \(\pi \mapsto \pi^{V^\pi}\), i.e. compute value of current policy, then update policy with one more step of foresight by making locally optimal choices according to the computed value.
- Intuitively, each iteration gives “one more step of foresight” to the associated quantities.
- Value iteration (proposition 1.2) iterate \(v_j\to V^*\) using the optimality operator, then act greedily w.r.t. \(v_T\).
- VI and PI highlight two fundamentally different methods to solving the control problem:
- VI: recurse the optimality operator, then propose a greedy policy.
- PI: evaluate the current policy, then act greedily w.r.t. the evaluation (fundamentally online).
Preliminaries
Definition 1.1 (finite-horizon Markov decision processes (MDP)) A finite MDP component consists of:
- A time horizon \(H\in \mathbb N\cup \{+\infty\}\) which specifies the number of interactions.
- State space \(\mathcal S\). A single state is denoted \(s\).
- Action space \(\mathcal A\).
- An initial distribution \(\mu \in \Delta(\mathcal S)\).
- State transition \(P:\mathcal S\times \mathcal A\to \Delta(\mathcal S)\).
- A reward function \(r:\mathcal S\times \mathcal A\to \mathbb R\). We consider deterministic reward functions.
Definition 1.2 (policy, trajectory) A policy is a mapping from state to action: \[ \pi:\mathcal S\to \Delta(\mathcal A) \] A trajectory is a tuple of interactions: \[ \tau = (s_0, a_0, r_0, \dots, s_{H-1}, a_{H-1}, r_{H-1}) \] A policy \(\pi\) induces a measure \(\rho^\pi\) on trajectories given by \[\begin{align} \log \rho^\pi(\tau) &= \log \left[\mathbb P(s_0, a_0, r_0) \prod_{t=1}^{H-1} \mathbb P(s_t, a_t, r_t | s_{t-1}, a_{t-1})\right] \\ &= \log \mu(s_0) + \log \pi(a_0|s_0) + \sum_{t=1}^{H-1} \log P(s_t|s_{t-1}, a_{t-1}) + \log \pi(a_t|s_{t-1}) \end{align}\]
To accomodate \(H=\infty\), we can also introduce a discount factor \(\gamma\in [0, 1]\). The core goal of an RL algorithm is to find the policy \[ \pi^* = \operatorname*{arg\,max}_{\pi }\mathop{\mathbb{E}}\limits_{\rho^\pi}[G(\tau)], \quad G(\tau) = \sum_{t=0}^H \gamma^t r_t \] Note that the total reward \(R\) is a \(\tau\)-dependent random variable.
Value functions
Given a policy, we need to evaluate its utility.
Definition 1.3 ((action-) value function) Given policy \(\pi\), its value function \(V^\pi:\mathcal S\to \mathbb R\) is defined by \[ V^\pi_h(s) = \mathop{\mathbb{E}}\limits_{\tau\sim \rho^\pi}[R_h(\tau)\mid s_h=s] \] here \(R_h=r_h+\dots+\gamma^{\dots} r_{H-1}\) is the truncated reward. Similarly, the action-value function is \[ Q^\pi_h(s, a) = \mathop{\mathbb{E}}\limits_{\tau\sim \rho^\pi}[R_h(\tau)\mid s_h=s, a_h=a] \] Note that the finite horizon makes \(V^\pi_h\neq V^\pi_{h-1}\); the \(h\)-dependence can be dropped when \(h=\infty\).
Proposition 1.1 \(V\) and \(Q\) are related by \(Q_h\to V_h, V_{h+1}\to Q_h\) \[\begin{align} V^\pi_h(s) &= \mathop{\mathbb{E}}\limits_{a\in \pi(\cdot\mid s)}Q^\pi_h(s, a) \\ Q^\pi_h(s, a) &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\in P(\cdot\mid s, a)}[V^\pi_{h+1}(s')] \end{align}\] Substituting again yields the recursive relations \(V_{h+1}\to V_h, Q_{h+1}\to Q_h\). \[\begin{align} V^\pi_h(s) &= % \mathop{\mathbb{E}}\limits_{\substack{a\in \pi(\cdot\mid s) \\ s'\in P(\cdot\mid s, a)}}[r(s, a) + \gamma\, V^\pi_{h+1}(s')] \\ Q^\pi_h(s, a) &= r(s, a) + \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} Q^\pi_{h+1}(s', a') \end{align}\] These are known as the Bellman consistency relations.
Definition 1.4 (Bellman consistency operators) We can rewrite the consistency relations in terms of consistency operators. Define the (value-) consistency operator \(v\mapsto \mathcal J^\pi_v\) of type \(\mathcal J^\pi_{(-)}: (\mathcal S\to \mathbb R)\to (\mathcal S\to \mathbb R)\). The action-value consistency operator \(\tilde{\mathcal J}_{(-)}^\pi\) is defined similarly: \[\begin{align} \mathcal J^\pi_v(s) &= % \mathop{\mathbb{E}}\limits_{\substack{a\in \pi(\cdot\mid s) \\ s'\in P(\cdot\mid s, a)}}[r(s, a) + \gamma\, v(s')] \\ \tilde{\mathcal J}_{q}^\pi(s, a) &= r(s, a) + \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} q(s', a') \end{align}\] It consumes a state-evaluation \(v\) and computes a state evaluation. Note that \(\mathcal J^\pi_{(-)}\) requires the environment dynamics \(P(\cdot\mid s, a)\) to be known. The consistency operators simplify the recursive definitions \[ V^\pi_h = \mathcal J^\pi_{V^\pi_{h+1}},\quad Q^\pi_h = \tilde {\mathcal J}^\pi_{Q^\pi_{h+1}} \]
Optimality
Definition 1.5 (optimal policy) A policy \(\pi^*\) is optimal if \[ \mathop{\mathbb{E}}\limits_{s_0\sim \mu} V^{\pi^*}_h(s_0) \geq \mathop{\mathbb{E}}\limits_{s_0\sim \mu} V^{\pi}_0(s_0), \quad \forall \pi \in \Pi \] The optimal (action-) value function is defined analogously \[\begin{align} V^*_h(s) = \max_\pi V^\pi_h(s), \quad Q^*_h(s, a) = \max_\pi Q^\pi_h(s, a) \end{align}\]
Theorem 1.1 (optimality relations) \(V^*\) and \(Q^*\) again satisfy the recursive relations \[\begin{align} V^*_h(s) &= \max_a Q^*_h(s, a) \\ &= \max_a \left[ r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} V^*_{h+1}(s') \right] \\ Q^*_h(s, a) &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} V^*_{h+1}(s') \\ &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \max_{a'} Q^*_{h+1}(s', a') \end{align}\]
Proof
It suffices to prove \(V^*_h(s) = \max_a Q^*_h(s, a)\) since the rest of the results follow from \(V\)-\(Q\) relations. Substitute the definition of \(Q^*\) to yield \[\begin{align} V^*_h(s) &= \max_\pi V^\pi_h(s) = \max_{\pi_h} \max_{\pi_{h+}} \mathop{\mathbb{E}}\limits_{a\sim \pi_h} Q^{\pi_{h+}}_{h+1}(s, a) \\ &= \max_{\pi_h} \mathop{\mathbb{E}}\limits_{a\sim \pi_h} \left[ \max_{\pi_{h+}} Q^{\pi_{h+}}_{h+1}(s, a) \right] = \max_a Q^*_h(s, a) \end{align}\]Definition 1.6 (optimality operators) Analogous to the consistency operators, define the \(V\)-optimality operator \(\mathcal J^*_{(-)}\) and \(Q\)-optimality operator \(\tilde{\mathcal J}^*_{(-)}\) so that \[\begin{align} \mathcal J^*_v(s) &= \max_a \left[ r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} v_{h+1}(s') \right] \\ \tilde {\mathcal J}^*_q(s, a) &= r(s, a) + \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \max_{a'} q(s', a') \end{align}\]
Remark. Note that the optimality equations have a very heavy dynamic programming flavor: solving \(V^*\) (or \(Q\)) at time \(h+1\) yields solutions for time \(t\).
Definition 1.7 (greedy policy)
Given action-value candidate \(q:\mathcal S\times \mathcal A\to \mathbb R\), define the greedy policy \(\pi^q(a\mid s) = 1_{a=\operatorname*{arg\,max}_{q}(s, a)}\). Note that \(q\) doesn’t even have to satisfy the consistency equations.Theorem 1.2 ($Q^*$-greedy policy is optimal) Given optimal \(Q^*\), the greedy policy \(\pi^{Q^*}\) is an optimal policy. This is nontrivial because greedy is a local behavior. In other words, \[ V^{\pi^{Q^*}} = V^* \]
Proof
We prove this for the finite-horizon case. Optimality is trivial for terminal \(t=H-1\) Fix any \(\pi\), inductively, \[\begin{align} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^\pi} [r_h + \dots + \gamma^{\dots}\, r_{H+1}\mid \tau_h] &= % \mathop{\mathbb{E}}\limits_{\substack{a_h\sim \pi_h(s_h) \\ s'\sim P(\cdot\mid s_h, a_h)}}\left[ r_h(s_h, a_h) + \gamma V^\pi_{h+1}(s') \right] \\ &\leq % \mathop{\mathbb{E}}\limits_{\substack{a_h\sim \pi_h(s_h) \\ s'\sim P(\cdot\mid s_h, a_h)}}\left[ r_h(s_h, a_h) + \gamma V^{\pi^{Q^*}}_{h+1}(s') \right] \\ &= \mathop{\mathbb{E}}\limits_{a_h\sim \pi_h(s_h)}Q^*_h(s, a) \leq \max_a Q^{\pi^{Q^*}}_h(s, a) \\ &= V^{\pi^{Q^*}}_h(s) \end{align}\] In short: expand to rewrite using \(V_{h+1}\), apply inductive inequality, then regroup.Definition 1.8 (contraction) We equip the space of \(v\) and \(q\) functions with the \(\sup\)-norm \(\|\cdot\|_\infty\) given by \[ \|v\|_\infty = \sup_s |v(s)|, \quad \|q\|_\infty = \sup_{s, a} |q(s, a)| \] An operator \(\mathcal J\) is a \(\gamma\in (0, 1)\) contraction if for any \(u, v\), we have \[ \|\mathcal J(v) - \mathcal J(u)\| \leq \gamma \|v-u\| \]
Theorem 1.3 (contraction properties) For any policy \(\pi\), the consistency operators \(\mathcal J^\pi_{(-)}\) and \(\mathcal{\tilde J}^\pi_{(-)}\) are \(\gamma\)-contractions. The optimality operators \(\mathcal J^*_{(-)}\) and \(\mathcal{\tilde J}^*_{(-)}\) are also \(\gamma\) contractions.
Proof
Recall the consistency operator definition 1.4; apply Jenson’s inequality and definition of the \(\sup\)-norm: \[\begin{align} |\tilde {\mathcal J}^\pi_q(s, a) - \tilde {\mathcal J}^\pi_{q'}(s, a)| &= \gamma \left| % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} q(s', a') - q'(s', a') \right| \\ &\leq \gamma % \mathop{\mathbb{E}}\limits_{\substack{s'\in P(\cdot\mid s, a) \\ a'\in \pi(\cdot\mid s')}} \left| q(s', a') - q'(s', a') \right| \leq \gamma \|q-q'\| \end{align}\] The argument for \(V\) follows similarly. For optimality operators, recall definition 1.6 and apply lemma 1.1 (fixing \(s')\): \[\begin{align} \left| \tilde {\mathcal J}^*_q(s, a) - \tilde {\mathcal J}^*_{q'}(s, a)\right| &= \gamma \left|\mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \left[ \max_{a_0} q(s', a_0) - \max_{a_1} q'(s', a_1) \right]\right| \\ &\leq \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} \left| \max_{a_0} q(s', a_0) - \max_{a_1} q'(s', a_1) \right| \leq \gamma \|q-q'\| \end{align}\]Lemma 1.1 \(|\max_i a_i - \max_j b_j| \leq \max_j |a_j - b_j|\).
Proof
Let \(\max_j a_j = a_{j^*}\), then \[ \max_i a_i - \max_j b_j = a_{j^*} - \max_j b_j \leq a_{j^*} - b_{j^*} \leq \max_j |a_j - b_j| \] Swap \(a\leftrightarrow b\) to obtain the desired inequality.Policy evaluation
Policy evaluation answers the following question:
- Given a policy \(\pi\), what is \(V^\pi\) and \(Q^\pi\)?
Remark (finite-horizon policy evaluation). In the finite horizon case, apply the consistency operators inductively to the trivial base case (terminal timestep). Algorithm:
- For \(t=H-1\dots 0\) do:
- For each state \(s\), compute \(\mathcal J^\pi_v(s)\) (definition 1.4) by iterating over \(a\in \mathcal A\) and \(s'\in \mathcal S\).
The algorithmic complexity is \(O(H\cdot |\mathcal A|\cdot |\mathcal S|^2)\). The action-value function is defined similarly.
Remark (infinite-horizon policy evaluation). For the infinite horizon case, We use the fixed-point relation \(Q^\pi = \tilde{\mathcal J}^\pi_{Q^\pi}\) Since the consistency operator is a contraction, iterative contraction converges with rate \(\gamma\). Algorithm: initialize \(q_0(s, a)=0\) and iteratively apply \(q_{t+1} = \tilde{\mathcal J}^\pi_{q_t}\).
Note that \(\|q_0 - Q^\pi\|\leq R/(1-\gamma)\), where \(R\) is the uniform reward bound. To achieve \(\epsilon\)-error, we need \[\begin{align} \|q_t - Q^\pi\| &\leq \gamma^{T(\epsilon)} \|q_0 - Q^\pi\| \leq \epsilon \implies T(\epsilon) \geq \dfrac 1 {1-\gamma} \log \dfrac{R}{\epsilon(1-\gamma)} \end{align}\]
Optimal solutions
The ultimate goal of RL is to answer the following question:
- Given a MDP, how to compute the optimal policy?
Definition 1.9 (finite-horizon (action-)value iteration) By theorem 1.2, computing \(Q^*\) suffices to yield the optimal policy. The value-iteration method uses the optimality recursion (definition 1.6) to compute \(Q^*\):
Algorithm complexity is \(O(H\cdot |\mathcal S|^2 \cdot |\mathcal A|)\) when we compute \(V^*\). When computing \(Q^*\):
- Initialize \(Q^*_{H-1}(s, a)=r(s, a)\); this is the base case.
- For \(t=H-2, \dots, 0\) do:
- For each \(s, a\), compute \(Q^*_t(s, a) = \tilde {\mathcal J}^*_{Q^*_{t+1}}(s, a)\) by iterating over \(s'\sim P(\cdot\mid s, a)\) and \(a'\in \mathcal A\).
- Output the \(Q^*\)-greedy policy.
Proposition 1.2 (infinite-horizon value iteration) For infinite MDPs, value iteration (VI) proceeds as follows:
- Initialize \(v_0 = 0\).
- Repeat until convergence:
- \(v_{t+1} = \mathcal J^*_{v_t}\).
- Output the \(v_T\)-greedy policy \(\pi^{v_T}\).
Two remarks:
- \(v_T\) might not be a consistent Bellman value function.
- Moreover, the value function of the \(v_T\)-greedy policy is, in general, different from \(v\)
- We only have equality \(V^{\pi^{V^*}} = V^*\) in the optimal case.
We can derive an explicit bound on how well greedy policy “reconstructs” its value function:
Theorem 1.4 (greedy policy value bound) Let \(\pi^*v\) be the greedy policy w.r.t. \(v:\mathcal S\to \mathbb R\) (note that \(v\) need not satisfy consistency relations), then \[ \|V^{\pi^v} - V^* \| \leq \dfrac{2\gamma}{1-\gamma} \|v-V^*\| \]
Proof
Assuming deterministic policy and denote \(\hat \pi = \pi^v,\hat V = V^{\hat \pi}\) \[\begin{align} V^*(s) - \hat V(s) &= Q^*(s, \pi^*(s)) - Q^{\hat \pi}(s, \hat \pi(s)) \\ &= \left[ Q^*(s, \pi^*(s)) - Q^*(s, \hat \pi(s)) \right] + \left[ Q^*(s, \hat \pi(s)) - Q^{\hat \pi}(s, \hat \pi(s)) \right] \end{align}\] Let \(q(s, a) = r(s, a) + \gamma\, \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} v(s')\) so that \(\hat \pi(a\mid s) = 1_{a=\operatorname*{arg\,max}_{a} q(s, a)}\), then note that by greedy construction of \(\hat \pi\), we have \[ q(s, \hat \pi(s)) \geq q(s, \pi^*(s)) \implies q(s, \hat \pi(s)) - q(s, \pi^*(s)) \geq 0 \] Add this to the first square bracket to obtain \[\begin{align} Q^*(s, \pi^*(s)) - Q^*(s, \hat \pi(s)) &\leq \left[ Q^*(s, \pi^*(s)) - q(s, \pi^*(s)) \right] + \left[ q(s, \hat \pi(s)) - Q^*(s, \hat \pi(s)) \right] \\ &\leq 2\gamma \|v-V^*\| \end{align}\] The last inequality comes from expanding \(q\) and \(Q^*\). The second term is bounded similarly: \[\begin{align} Q^*(s, \hat \pi(s)) - Q^{\hat \pi}(s, \hat \pi(s)) &= \gamma\, \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \hat \pi(s)} V^*(s') - v(s') \\ &\leq \gamma \|v-V^*\| \end{align}\] Combining the results yields below; rearranging proves inequality, as claimed. \[\begin{align} \|V^* - \hat V\| &\leq 2\gamma \|v-V^*\| + \gamma \|V^* - \hat V\| \end{align}\]We can alternatively iterate the policy.
Definition 1.10 (infinite-horizon policy iteration) Consider the following algorithm:
- Initialize a uniform policy \(\pi_0\).
- Repeat until convergence:
- \(\pi_{t+1} = \pi^{V^{\pi_t}}\).
- In human words: compute the value function of \(\pi_t\) and act greedily w.r.t. the value function.
The intuition behind this construction is that for the operator \(\mathcal G:\pi \mapsto \pi^{V_\pi}\), the greedy-optimal operator \(\pi^{V^*}\) is a fixed point.
Theorem 1.5 (policy iteration converges optimally) Policy iteration converges to the optimal-greedy policy according to the rate \[ \| V^{\mathcal G(\pi)} - V^* \| \leq \gamma \|V^\pi - V^*\| \]
Proof
The proof will be concluded if we show that \(V^{\mathcal G(\pi)} \geq \mathcal J^*_{V^\pi} \geq V^\pi\), since together with the contraction property of the Bellman optimality operator (theorem 1.3), we have \[ \| V^* - V^{\mathcal G(\pi)}\| \leq \| \mathcal J^*_{V^\pi} - V^* \| \leq \gamma \|V^\pi - V^*\| \] First note that \(\mathcal J^*_{V^\pi} \geq V^\pi\), then \[\begin{align} V^{\mathcal G(\pi)}(s) - V^\pi(s) &\geq V^{\mathcal G(\pi)}(s) - \mathcal J^*_{V^\pi}(s) \\ &= \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \mathcal G_\pi(s))} \left[ V^{\mathcal G(\pi)}(s') - V^\pi(s) \right] \end{align}\] Apply this relationship recursively to yield \(V^{\mathcal G(\pi)} \geq V^\pi\). To see why the equation above holds, note that \(\mathcal G_\pi(s)\) is exactly the value which maximizes \(Q^\pi(s, a)\), i.e. \[\begin{align} \mathcal J^*_{V^\pi}(s) &= \max_a \left[ r(s, a) + \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)} V^\pi(s') \right] \\ &= r(s, \mathcal G_\pi(s)) + \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \mathcal G_\pi(s))} V^\pi(s) \end{align}\] Next, compute \[\begin{align} V^{\mathcal G(\pi)}(s) - \mathcal J^*_{V^\pi}(s) &= \gamma \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, \mathcal G_\pi(s)}\left[ V^{\mathcal G(\pi)}(s') - V^\pi(s') \right]_{\geq 0} \geq 0 \end{align}\] In short, the proof idea is:
- We need to show that \(V^{\mathcal G(\pi)} \geq \mathcal J^*_{V^\pi}\) to apply the convergence theorem.
- We note that \(V^{\mathcal G(\pi)} - J^*_{V^\pi}\) s an expectation over \(V^{\mathcal G(\pi)} - V^\pi\), so it suffices to show that \(V^{\mathcal G(\pi)} \geq V^\pi\).
- To show the desired relation above, apply \(V^\pi \leq \mathcal J^*_{V^\pi}\) recursively.
Remark. Note that the value function of a deterministic policy can be computed inclosed form for an infinite-horizon, finite-space MDP with deterministic rewards. Consider quantities of types \[ P^\pi\in \mathbb R^{|\mathcal S|\times |\mathcal S|}, \quad r, \mu\in \mathbb R^{|\mathcal S|} \] Here \(\langle s'|P^\pi|s\rangle= P(s'\mid s, \pi(s))\). In this notation, the Bellman consistency conditions can be solved in closed form as \[ V^\pi = r + \gamma\, P^\pi V^\pi \implies V^\pi = (I - \gamma\, P^\pi)^{-1} r^\pi \]
Remark (VI vs PI). Note that policy iteration (PI) cannot take more than \(|\mathcal A|^{|\mathcal S|}\) iterations since there are only so many deterministic policies. On the other hand, value iteration might take more since different value evaluations \(\neq\) different greedy policies.
- Consider single-state MDP with \(r(s, a)=1, \gamma=.9\) and \(V_0=0\). Then \(V_1(s)=1\) while \(V^*(s)=10\).