3 Policy Methods

In this section, we differentiably parameterize the policy and improve by gradient descent. While value methods solve control by computing the (action-) value functions, policy methods directly improve the policy.

The Policy Gradient (PG) theorem 3.1 establishes the general form of score function $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ multiplied by signal $R(\tau)$.
Two problems with vanilla PG: high variance of signal and low sample efficiency from online learning
Method to reduce variance:
- Baseline theorem 3.2 establishes that adjusting by a state and history-dependent function does not introduce bias.
  - Baseline cannot depend on $a_t$; can be $\theta$-dependent; should be frozen during $\theta$-gradient calculation.
  - Remaining rewards, $Q(s, a)$, and advantage $A=Q-V$ are all valid signals (corollary 3.1).
- Generalized advantage estimation: interpolation between bootstrapping and sampling.
Using off-policy data: intuitively, similar policies should be able to share data.
- Relative policy performance theorem 3.4: performance of $\pi'$ is lower bounded by (performance of $\pi$) + (a surrogate loss $\mathcal L_\pi(\pi')$) + (square root of KL).
  - Surrogate loss should be maximized (confusing entymology).
  - Surrogate loss gradient degenerates to online PG gradient when $\pi'=\pi$.
  - Theorem proof: performance difference lemma 3.3 + info-theory arguments (variational characterization of TV + Pinsker inequality).
- Implication: off-policy PG = optimizing surrogate-loss subject to small KL deviation
- Approximately enforce small-KL constraint by clipping: PPO-Clip 3.8.
Full off-policy PG algorithm 3.7. Moving parts include:
- Estimating advantage: use MC or (how much) TD? Using another value-based approximation model is referred to as using a critic.
- Enforcing KL-constraints: clipping, adaptive-KL, or other methods.
- Balancing offline v. online: how many actor update steps prior to collecting a new batch of data?

The core problem of RL is sampling efficiency; key obstacles include:

Estimation efficacy: when estimating quantities such as policy-gradients, $V$, or $Q$, need to consider bias-variance tradeoff.
On-policy data reuse: it takes nontrivial effort for data to be reusable for on-policy methods.

Vanilla policy gradient

Policy-based RL has better convergence properties, learns stochastic policies, and are often effective in high-dimensional or continuous spaces.
On the other hand, convergence is local and evaluating a policy is often high-variance.

Definition 3.1 (policy value) Fixing a MDP and parameterization $\pi_\theta$, the policy value is written \[ J(\theta) = \mathop{\mathbb{E}}\limits_{\tau\sim \rho^{\pi_\theta}}[R(\tau)] = \mathop{\mathbb{E}}\limits_{\tau\sim \rho^{\pi_\theta}}\left[ \sum_{t=0}^{H-1} r_t(\tau)\right] \]

The theorem below is foundational in policy methods. In particular, it does not require $R(\tau)=\sum \gamma^t r_t$ to be differentiable. Note that the $\theta$-dependence is implicit through the trajectory measure $\rho^{\pi_\theta}$.

Theorem 3.1 (Policy-Gradient) $\nabla_\theta J(\theta) = \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[R(\tau) \sum_{t=0}^{H-1} \nabla \log \pi_\theta(a_t\mid s_t)\right]$.

Proof (log-EV trick)

Note that $\nabla_\theta R(\tau)=0$ since the reward function $R$ itself is independent of $\theta$, then \[\begin{align} \nabla_\theta J(\theta) &= \nabla_\theta \int R(\tau) \rho^{\pi_\theta}(\tau) \, d\tau \\ &= \int R(\tau) \nabla_\theta \rho^{\pi_\theta}(\tau)\, d\tau \\ &= \int R(\tau) \rho^{\pi_\theta}(\tau) \dfrac{\nabla_\theta \rho^{\pi_\theta}(\tau)}{\rho^{\pi_\theta}(\tau)}\, d\tau \\ &= \mathop{\mathbb{E}}\limits_{\tau}\left[R(\theta) \nabla_\theta \log \rho^{\pi_\theta}(\tau)\right] \end{align}\] where $\dfrac 1 {g(\theta)}\nabla_\theta g(\theta) = \nabla_\theta \log g(\theta)$. Proceeding to compute $\nabla_\theta \log \rho^{\pi_\theta}(\tau)$, the environment-dynamics log-probs are $\theta$-independent, so \[\begin{align} \nabla_\theta \log \rho^{\pi_\theta}(\tau) &= \nabla_\theta \sum_{t=0}^{H-1} \log \pi_\theta(a_t\mid s_t) + \log P(s_{t+1}\mid s_{t-1}, a_{t-1}) + \text{ reward terms} \\ &= \sum_{t=0}^{H-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t) \end{align}\]

The $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ is the parameter-space “direction” which will increase the policy’s popensity to choose $\pi_\theta(a_t\mid s_t)$, inversely weighted by the likelihood of the policy selecting that action.
- This normalization is necessary to offset the sampling weight in $\mathbb E$.
- The direction $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ is also known as the score function.
Interpretation: the normalized policy directions $\nabla \log \pi_\theta(a_t\mid s_t)$ should be “reinforced” by the accumulated reward on the trajectory $R(\theta)$.
The trajectory reward $R(\tau)$ is extremely high-variance.

Intuition tells us that the reinforce signal for the action at time $t$ should not be dependent upon past rewards $r_t, \dots, r_{t-1}$. This is in fact the case. We prove a more general result in the next subsection.

Reducing variance

Baseline

To reduce variance, we prove that the gradient direction of $\pi_\theta(a_t\mid s_t)$ can be adjusted by a $(s_0, \dots, s_{t-1}, a_{t-1}, r_{t-1}, s_t)$-dependent baseline $B_t$ without introducing bias, so that:

$B_t=$ accumulated rewards: $R_t-B_t=$ remaining rewards is a valid signal.
Taking expectation of remaining rewards: $Q^\pi(s_t, a_t)$ is a valid signal.
$B_t$ additionally includes $V(s_t)$: the advantage function $A^\pi=Q^\pi-V^\pi$ is a valid signal.

Theorem 3.2 (baseline trick) Fixing $t$, given any deterministic function $B_t(\sigma_t)$ where $\sigma_t(\tau)=(s_0, a_0, r_0, \dots, s_{t-1}, a_{t-1}, r_{t-1}, s_t)$ is a slice of trajectory, we have \[ \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[ B_t(\sigma_t(\tau)) \nabla_\theta \log \pi_\theta(a_t\mid s_t) \right] = 0 \] Note that inside the expectation, $\sigma_t(\tau)$ will still be a random slice of the trajectory.

Important proof: condition upon $s_0, \dots, s_t$

Let $\mathcal F_t=\sigma(s_0, a_0, r_0, \dots, s_t)=\mathcal F_t(\sigma_t)$ denote the history. The key property here is that the baseline $B_t(\sigma_t)$ is a constant w.r.t. this conditioning: \[ \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[ B_t(\sigma_t) \nabla_\theta \log \pi_\theta(a_t\mid s_t) \mid \mathcal F_t \right] = B_t(\sigma_t) \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[ \nabla_\theta \log \pi_\theta(a_t\mid s_t) \mid \mathcal F_t \right] \] The remaining expectation vanishes, completing the proof: \[\begin{align} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\mid \mathcal F_t\right] &= \nabla_\theta \sum_{a_t, s_t} \left[P(s_t\mid s_{t-1}, a_{t-1}) \pi_\theta(a_t\mid s_t)\right] \nabla_\theta \log \pi_\theta(a_t\mid s_t) \\ &= \nabla_\theta \sum_{a_t, s_t} P(s_t\mid s_{t-1}, a_{t-1}) = 0 \\ \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[ \nabla \log \pi_\theta(a_t\mid s_t)\, B_t(\tau) \right] &= \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[ \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[\nabla \log \pi_\theta(a_t\mid s_t)\mid \mathcal F_t\right]_{=0} \, B_t(\tau) \right] = 0 \end{align}\]

Note that the baseline function for the “reinforcement direction” (aka score function) of the conditional action $\pi_\theta(a_t\mid s_t)$ at time $t$ can only depend on $s_t$, not $a_t$ or $r_t$. The available variables $\sigma_t(\tau)$ are chosen to satisfy the key equation \[ \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[ B_t(\sigma_t) \nabla_\theta \log \pi_\theta(a_t\mid s_t) \mid \mathcal F_t \right] = B_t(\sigma_t) \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}} \left[ \nabla_\theta \log \pi_\theta(a_t\mid s_t) \mid \mathcal F_t \right] \] Here the baseline $B_t(\sigma_t)$ must be a constant w.r.t. the conditioned variables while the score function remains r.v.

Definition 3.2 (advantage function, remaining rewards) Given a policy $\pi$, the advantage function is defined as \[ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s, a) \] The remaining reward of a trajectory at time $t$ is defined as \[ G_t(\tau) = \sum_{t'=t}^{H-1} \gamma^{t'-t} r_{t'}(\tau) \]

Corollary 3.1 (valid PG signals with lower variance) Using remaining rewards, $Q$-function, or advantage function to reinforce action at time $t$ all result in an unbiased estimator with lower variance: \[\begin{align} \nabla_\theta J(\theta) &= \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[G_t(\tau)\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\right] \\ &= \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[Q^\pi(s_t, a_t)\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\right] \\ &= \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[A^\pi(s_t, a_t)\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\right] \end{align}\]

Proof

The first equation follows from the baseline theorem 3.2 by setting $B_t(\sigma_t)$ to be the accumulated rewards. We furnish another slick proof here:

Assume for simplicity $\gamma=1$. Let $J_t(\theta)=\mathop{\mathbb{E}}\limits_{\tau}[r_t(\tau)]$, then applying the REINFORCE theorem to the single reward term $J_t=\mathop{\mathbb{E}}\limits_{\tau}[r_t(\tau)]$ yields \[\begin{align} \nabla_\theta J_t(\theta) &= \mathop{\mathbb{E}}\limits_{\tau_{:t+1}}\left[r_t(\theta) \nabla_\theta \log \rho^{\pi_\theta}_{:t+1}(\tau_{:t+1})\right] \end{align}\] Note that here $r_t$ is only dependent upon $\tau_{:t+1}$ (so includes up to $\tau_t$). Since we’ve only sampled up to time $t$, the $\nabla_\theta \log \rho^{\pi_\theta}_{:t+1}(\tau_{:t+1})=\sum_{t'=0}^{t} \nabla \log \pi_\theta(a_{t'}\mid s_{t'})$. Rearranging sums and summing over all $t$ completes the proof.

To prove the validity of using $Q$, invoke tower law w.r.t $\mathcal F_t=\sigma(s_0, \dots, s_t, a_t)$. Note that $\mathbb E[G_t(\tau)\mid \mathcal F_t] = \mathbb E[r_{t}+\dots\mid s_t, a_t] = Q(s_t, a_t)$: \[\begin{align} \nabla_\theta J(\theta) &= \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[\mathop{\mathbb{E}}\limits_{}[G_t(\tau)\mid \mathcal F_t]\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\right] \\ &= \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi_\theta}}\left[Q(s_t, a_t) \nabla_\theta \log \pi_\theta(a_t\mid s_t)\right] \end{align}\]

Applying the baseline theorem to the result above proves the validity of the advantage function.

Definition 3.3 (REINFORCE algorithm) Using corollary 3.1 above, consider the online, on-policy algorithm:

Initialize stochastic policy $\pi_\theta$.
Repeat until convergence:
- Sample episode $\{s_1, a_1, r_1, \dots, s_H, a_H, r_H\}\sim \rho^{\pi_\theta}$:
- Compute $G_1, \dots, G_H$; note that these are treated as constants w.r.t. $\theta$.
- For $t=1\dots H$ update $\theta\mapsto \theta + \alpha \nabla_\theta \log \pi_\theta(s_t, a_t)G_t$.
Return $\pi_\theta$.

Definition 3.4 (Monte-Carlo PG) Using the methods above, consider the following control algorithm:

Initialize $\pi_\theta, B_\phi$.
Repeat until convergence:
- Collect trajectories $(\tau_j)$ on-policy.
- Compute $G_{jt}$ for each policy and timestep.
- Fit the baseline by minimizing $\sum_{j, t} |B_\phi(s_{jt}) - G_{jt}|^2$ (or use TD methods).
- Update policy $\theta \leftarrow \theta + \alpha\, \sum_{j, t} [G_{jt} - B_\phi(s_{jt})]\nabla_\theta \log \pi_\theta(a_{jt}\mid s_{jt})$.

Remark. The algorithm above is equivalent to PG with (MC returns estimate + MC value baseline estimate). Other alternatives include:

Estimate value baseline using TD.
Bootstrap $G_{jt}$ using TD instead of using MC.

In these variations which introduce bootstrap, the policy model $\pi_\theta$ is known as the actor, and the value model $B_\phi$ the critic.

Also note that the baseline estimate can be $\theta$-dependent, since the baseline trick does not require $B_t$ to be $\theta$-independent

Generalized Advantage Estimation

Another way of reducing bias is to introduce bootstrap the reinforcement signal. Consider the $N$-step advantage estimators:

\[\begin{align} \hat A_t^{(1)} &= r_t + \gamma V(s_{t+1}) - \hat V(s_t) \\ \hat A_t^{(2)} &= r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) - \hat V(s_t) \\ \hat A_t^{(\infty)} &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots - \hat V(s_t) \\ \end{align}\]

Note again that the reinforcement signal (advantage / Q-value / total reward) is frozen w.r.t. $\nabla_\theta$.

More steps imply (closer to MC) + (higher variance) + (lower bias). In either case, however, advantage is consistent so long as $\hat V\to V$. Define the random variable \[ \delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t) \] It can be proven inductively that \[\begin{align} \hat A^{(k)}_t = \sum_{j=0}^{k-1} \gamma^j r_{t+j} \gamma^k \hat V(s_{t+k}) - \hat V(s_t) \end{align}\]

Definition 3.5 (Generalized Advantage Estimator) GAE is an exponentially-weighted average of $k$-step advantage estimators which happen to have a very simple, tractable form. \[\begin{align} \hat A_t^{\mathrm{GAE}(\gamma, \lambda)} &= (1-\lambda) \left(\hat A_t^{(1)} + \lambda \hat A_t^{(2)} + \lambda^2 \hat A_t^{(3)} + \dots\right) = \dots \\ &= \sum_{l=0}^\infty (\lambda \gamma)^l \delta^V_{t+l} \end{align}\] Note that $\lambda=0$ reduces to TD while $\lambda=1$ corresponds to MC.

Off-policy, conservative updates

The central improvements to PG all involve problems stemming from (1) high-variance of rewards, or (2) the online property: we only ever update policy based on trajectories collected using the current policy.

The on-policy update rule results in low data efficiency.
Parameterization: the norm in $\theta$-space might be drastically different from that in $\pi_\theta$ space.
- For example, when logits saturate in logistic / softmax parameterizations.
Performance collapse: if $\theta$-update results in bad policy, gradient samples and updates might be stuck.

As a result, we want improvements to be close to original policy in KL; this mitigates performance collapse and approximately allows us to use off-policy data.

One solution to (2)

Natural gradints: coordinate-change by the ($\pi_\theta$-KL, $\theta$-$L_2$) Hessian, which is the Fisher information; this coordinate map has trivial Jacobian generally.

However, Fisher information matrix is prohibitive to compute generally. A more practical solution is to constrain the new policy to be close to the original in KL.

Theorem 3.3 (performance difference lemma (PDL)) Given policies $\pi, \pi'$, define the induced state distribution \[ d^{\pi'}(s) = (1-\gamma)\sum_{t=0}^\infty \gamma^t \rho^{\pi'}(s_t=s) \quad \text{$H=\infty$} \] The policy performance difference can be written as \[\begin{align} J(\pi') - J(\pi) &= \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi'}} \left[\gamma^t A^\pi(s_t, a_t)\right] \\ &= \dfrac{1}{1-\gamma} % \mathop{\mathbb{E}}\limits_{\substack{a\sim \pi' \\ s\sim d^{\pi'}}} \left[A^\pi(s, a)\right] \quad \text{when $H=\infty$} \end{align}\]

Telescope proof: expand $A^\pi$ purely in terms of $V^\pi$

\[\begin{align} A^\pi(s_t, a_t) &= Q^\pi(s_t, a_t) - V^\pi(s_t) = r(s_t, a_t) + \gamma \mathop{\mathbb{E}}\limits_{s_{t+1}\sim P(s_t, a_t)}[V^\pi(s_{t+1})] - V^\pi(s_t) \end{align}\] Sustituting this into the sum yields \[\begin{align} \sum_{t=0}^{H-1} \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi'}} \left[\gamma^t A^\pi(s_t, a_t)\right] &= \mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi'}} \sum_{t=0}^{H-1} \gamma^t \left[ r(s_t, a_t) + \gamma\, V^\pi(s_{t+1}) - V^\pi(s_t) \right] \\ &= \left(\mathop{\mathbb{E}}\limits_{\tau \sim \rho^{\pi'}} \sum_{t=0}^{H-1} \gamma^t r_t \right) + \gamma^H V^\pi(s_H)_{=0} - \mathop{\mathbb{E}}\limits_{s_0}\left[V^\pi(s_0)\right] \\ &= J(\pi') - J(\pi) \end{align}\]

For the finite-horizon case, PDL states that $J(\pi') - J(\pi)=$ sum of $\pi$’s advantage w.r.t the current state at each time step.
The key equation is $A^\pi(s_t, a_t) = r(s_t, a_t) + \left[\gamma \mathop{\mathbb{E}}\limits_{\text{env}}[V^\pi(s_{t+1})] - V^\pi(s_t)\right]$. Under $\pi'$, the first term contributes $J(\pi')$, while the second term contributes $J(\pi)$ under telescoping.

Definition 3.6 (surrogate loss) Denote the estimated performance of $\pi'$ using $\pi$-trajectories by $\mathcal L_\pi(\pi')$: \[\begin{align} \mathcal L_\pi(\pi') &= \dfrac{1}{1-\gamma} % \mathop{\mathbb{E}}\limits_{\substack{a\sim \pi \\ s\sim d^{\pi}}} \left[\dfrac{\pi'(a\mid s)}{\pi(a\mid s)}A^\pi(s, a)\right] \end{align}\] The only difference from the analytic expression in theorem 3.3 is $d^{\pi'}\mapsto d^\pi$. This is also known as the surrogate loss for off-policy PG.

Theorem 3.4 (relative policy performance bounds) The performance difference estimate in theorem 3.3 is accurate up to expected KL of the policies (Achiam et al. 2017). \[ \left|\left[J(\pi') - J(\pi)\right] - \mathcal L_\pi(\pi')\right| \leq C\sqrt{\mathop{\mathbb{E}}\limits_{s\sim d^\pi} D_{\mathrm{KL}}\left(\pi'(\cdot\mid s)\| \pi(\cdot\mid s)\right)} \] Rearranging and substituting $\mathcal L_\pi(\pi')$ yields the lower performance bound \[\begin{align} J(\pi') &\geq J(\pi) + \dfrac{1}{1-\gamma} % \mathop{\mathbb{E}}\limits_{\substack{a\sim \pi \\ s\sim d^{\pi}}} \left[\dfrac{\pi'(a\mid s)}{\pi(a\mid s)}A^\pi(s, a)\right] - C\sqrt{\mathop{\mathbb{E}}\limits_{s\sim d^\pi} D_{\mathrm{KL}}\left(\pi'(\cdot\mid s)\| \pi(\cdot\mid s)\right)} \end{align}\]

Proof sketch

Fixing $\pi$, define \[ g(s) = \dfrac{1}{1-\gamma} \mathop{\mathbb{E}}\limits_{a\sim \pi} \left[\dfrac{\pi(a\mid s)}{\pi'(a\mid s)}A^\pi(s, a)\right] \] So that $\mathcal L_\pi(\pi') = \mathop{\mathbb{E}}\limits_{s\sim d^\pi} g(s)$ and $J(\pi')-J(\pi) = \mathop{\mathbb{E}}\limits_{s\sim d^{\pi'}}g(s)$. Next, recall the variational characterization of total-variation distance (Polyanskiy and Wu 2025): let $\mathcal F = \{f:\mathcal X\to \mathbb R, \|f\|_\infty \leq 1\}$, then \[ \mathrm{TV}(P, Q) = \dfrac 1 2 \sup_{f\in \mathcal F} \left[\mathbb{E}_P f(X) - \mathbb{E}_Q f(X)\right] \] Using a suitably chosen constant so that $g/D\in \mathcal F$, we have \[ \left|\left[J(\pi') - J(\pi)\right] - \mathcal L_\pi(\pi')\right| \leq 2D \, \mathrm{TV}(\pi, \pi') \] The bound follows from Pinsker’s inequality $D(P\|Q)\geq 2\, \mathrm{TV}(P, Q)^2$.

Corollary 3.2 (off-policy PG) Taking a step back, we have proven that a lower bound on $J(\pi')$ can be obtained from samples from $\pi$: \[ \text{Maximizing } J(\pi'_\theta) \text{ w.r.t. $\theta$} \impliedby \text{ maximizing } \mathcal L_\pi(\pi'_\theta) \text{ w.r.t. $\theta$} \] when $D_{\mathrm{KL}}\left(\pi'(\cdot\mid s)\| \pi(\cdot\mid s)\right)$ is small.

Note that the parameter-dependence of the old policy $\pi$ is irrelevant since $\pi'$ uses new parameters.
Further note that $\nabla_\theta \mathcal L_\pi(\pi) = \nabla_\theta \mathop{\mathbb{E}}\limits_{}[Q(s, a)] = \nabla_\theta J(\pi_\theta)$, so we can just use the surrogate loss: \[\begin{align} \nabla_\theta \mathop{\mathbb{E}}\limits_{\pi_\theta} \left[ \dfrac{\pi_\theta(a\mid s)}{\pi_\theta(a\mid s)_{\text{frozen}}} A^{\pi_\theta}(s, a)_{\text{frozen}} \right] = \mathop{\mathbb{E}}\limits_{\pi_\theta} \left[ A^{\pi_\theta}(s, a)_{\text{frozen}} \nabla_\theta \log \pi_\theta(a\mid s) \right] \end{align}\]

Definition 3.7 (off-policy PG with surrogate loss) Consider the following algorithm: initialize $\pi_\phi$ and repeat until convergence:

Initialize $\theta = \phi$.
Collect $N$ trajectories $\tau_j$; also collect $\pi_\phi(a_{jt}\mid s_{jt})$ for each move.
Estimate advantages $A^{\pi_\phi}(s_{jt}\mid s_{jt})\approx Q^{\pi_\phi}(s_{jt}\mid a_{jt}) - V^{\pi_\phi}(a_{jt})$.
- Advantages are treated as constants in gradient computation.
Repeat until convergence:
- Update $\theta$ with gradient $\displaystyle \dfrac{1}{(1-\gamma)N} \nabla_\theta\sum_{jt} [(1-\gamma)\gamma^t]\dfrac{\pi_{\theta}(a_{jt}\mid s_{jt})}{\pi_\phi(a_{jt}\mid s_{jt})}A^{\pi_\phi}(s_{jt}\mid s_{jt})$.
  - Optimization is performed subject to small KL.
  - For finite-horizon tasks, replace $1/(1-\gamma)\mapsto H$ and $[(1-\gamma)\gamma^t]\mapsto H$
Update $\phi \leftarrow \theta$.

Proceeding to solve the optimization problem subject to small $D_{\mathrm{KL}}\left(\pi'(\cdot\mid s)\| \pi(\cdot\mid s)\right)$. Recall that \[ D(P\|Q) = \mathop{\mathbb{E}}\limits_{P}\log \dfrac{dP}{dQ} \] We will be assured small KL if $\pi'(a|s)/\pi(a|s)\approx 1$. One way to do so is to withold reinforcing feedback to increase (resp. decrease) $\pi'(a|s)$ when $\pi'(a|s)/\pi(a|s)$ is greater (resp. less) than $\pi(a|s)$ by some amount. This results in PPO-clip.

Definition 3.8 (PPO-Clip) Choose hyperparameter $\epsilon<1$ (usually $0.1-0.3$); same as algorithm 3.7 except with \[ r^\theta_{jt} A^{\pi_\phi}(s_{jt}\mid s_{jt}) \mapsto \min \left[ r^\theta_{jt} A_{jt}, \mathrm{clip}(r^\theta_{jt}, 1\pm \epsilon) A_{jt} \right], \quad r^\theta_{jt} = \dfrac{\pi_{\theta}(a_{jt}\mid s_{jt})}{\pi_\phi(a_{jt}\mid s_{jt})} \]

V-trace

In canonical policy-gradient with baselines, we’re training two models:

A policy model $\pi_\theta$; here, we can use the surrogate loss and clipping to ensure unbiased training under off-policy data.
A value-model $V_\phi$; we don’t know how to train this effectively using off-policy data just yet.

The IMPALA paper proposed directly applying importance sampling correction to the observed values and rewards, then training the policy in an on-policy manner on these corrected trajectories. Recall that given reference policy $\pi$, we have

\[\begin{align} J(\pi') &= \sum_{t=0}^{H-1} % \mathop{\mathbb{E}}\limits_{\substack{a\sim \pi' \\ s\sim d^{\pi'}}}\gamma^t \left[r(a_t, s_t)\right] \\ &= \sum_{t=0}^{H-1} % \mathop{\mathbb{E}}\limits_{\substack{a\sim \pi \\ s\sim d^{\pi'}}}\left[\dfrac{\pi'(a_t\mid s_t)}{\pi(a_t\mid s_t)}r(a_t, s_t)\right] \\ \end{align}\]

When $\pi\approx \pi'$, the state distribution $d^\pi, d^{\pi'}$ are close, so we can train from off-policy trajectory by viewing as if we’ve observed importance-sampling corrected rewards, so long as policies are similar. Also recall that closer rollouts are subject to less deviation than further ones.

Definition 3.9 (V-trace algorithm) Given a value function $V$ and hyperparameters $\bar \rho, \bar c$, define the $n$-step $V$-trace target for the current approximation as \[\begin{align} v_s &= V(s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} \left(\prod_{i=s}^{t-1}\right) \delta_t ,\quad c_t = \min \left[\bar c, \dfrac{\pi'(a_t\mid x_t)}{\pi(a_t\mid x_t)}\right] \\ \delta_t &= \rho_t \left[r_t + \gamma V(x_{t+1}) - V(x_t)\right], \quad \rho_t = \min \left[\bar \rho, \dfrac{\pi'(a_t\mid x_t)}{\pi(a_t\mid x_t)}\right] \end{align}\] Note that the trajectory $(a_t, s_t)$ is collected using reference policy $\pi$, while we’re approximating a value function for $\pi'$. Unpacking the notation:

When $\pi=\pi'$, $\delta_t$ is just the one-step temporal difference, and $v_s$ is the n-step Bellman backup target.
The term $\rho_t$ introduces a clipped correction to the TD correction, preventing large variance at the expense of biasing the $v$-estimate towards the reference $\pi$.
- It can be shown that the value function corresponding to the clipped TD-correction is some interpolation between $\pi$ and $\pi'$.
The term $c_t$ weighs the different TD corrections, down-weighing later $\delta_t$’s.

The final V-trace algorithm proceeds as follows:

Collect trajectories using reference policy $\pi$, recording $(a_t, s_t, \pi(a_t\mid s_t), r_t)$.
When updating $\pi'$ and its value estimate $V$, first update $V$ according to V-trace targets $v$.
Compute the corrected advantage $A_s=r_s + \gamma v_{s+1} - V(x_s)$ and update policy parameters in direction of the gradient \[ \rho_s \nabla \log \pi(a_s\mid x_s)A_s \] Note that the (clipped) importance sampling is applied here as well.

Bibliography

Achiam, Joshua, David Held, Aviv Tamar, and Pieter Abbeel. 2017. “Constrained Policy Optimization.” In International Conference on Machine Learning, 22–31. PMLR.

Polyanskiy, Yury, and Yihong Wu. 2025. Information Theory: From Coding to Learning. Cambridge university press.