4 Inverse RL

In this section, we relax the assumption that the reward model is known. The task of inverse RL is to obtain good policy from expert demonstrations.

Naive approach: behavior cloning i.e. naively supervision based on the expert’s behavior. Problems:
1. i.i.d. assumption violated due to state-distribution shifts.
2. Can never out-perform the expert.
Alternative approach: infer the reward, then optimize w.r.t. reward.
- Bottleneck becomes (being able to best infer reward from preferences) + (optimizing given reward).
- Capable of out-performing expert, given accurate reward modeling (!!!)
Reward shaping theorem 4.1: potential-based transformations \(r(s, a, s')\mapsto r(s, a, s') + \gamma \Phi(s') - \Phi(s)\) is the only symmetry class under all possible dynamics and baseline rewards
- Main idea: the transformation effects \(V(s) \mapsto V(s) + \Phi(s)\) so the optimal policy remains invariant.
- Fixing \(P\) and baseline \(r\), there can be more symmetry.
Classical approach 4.5 elucidates adversarial iteration between:
- Maximizing reward gap between expert reward and that of the current policy.
- Maximizing policy reward w.r.t. estimate.
Another approach to the reward identifiably problem is to use the principle of maximum entropy (algorithm 4.7). Assuming knowledge of the dynamics model \(P\), every policy \(\pi_\theta\) induces a distribution \(\rho^{\pi_\theta}\). Each reward model \(r_\phi\) induces a maximal-entropy policy \(\pi^*(r_\phi)\) compatible with empirical reward (turns out to be Boltzmann), which in turn induces an optimal distribution \(\rho^*_{r_\phi}\) over trajectories.
1. We look for a reward model whose entropy-maximizing \(\pi_\phi\) induces a trajectory distribution satisfying (theorem 4.2):
  - Reward-compatible with empirical expert trajectory.
  - Has maximal entropy across \(r_\phi\).
  - Note that there’re two layers of max-entropy (across distributions, constrained; and across reward functions).
2. We break this optimization into optimizing \(r_\phi\) over (maximum entropy given \(r_\phi\), and reward-compatible with empirical data); the inner optimizer is a Boltzmann distribution.
3. It turns out that this optimization is equivalent to looking for a reward model under whose Boltzmann distribution the empirical expert trajectories are MLE: key theorem 4.2
  - Interestingly, the gradient looks exactly like maximizing the separation between the empirically expected reward and the expected reward under the Boltzmann distribution, echoing the classical IRL construction 4.5.
  - Critical assumption: the parameterization \(r_\phi\) has linear degree of freedom.

Zeroth-order approaches

Definition 4.1 (problem setup) In the imitation setup, we have access to:

State and action spaces, transition model.
No reward model \(R\).
Set of one or more teacher’s demonstrations \((s_{jt}, a_{jt})\).

Interesting tasks include:

Behavior cloning: how to reproduce the teacher’s behavior?
Inverse RL: how to recover \(R\)?
Apprenticeship learning via inverse RL: use \(R\) to generate a good policy.

Definition 4.2 (learning from demonstrations) Given demonstration trajectories \((s_{tj}, a_{tj})\) , train a policy with supervised learning.

One problem with behavior cloning: compounding errors. Supervised learning assumes \(s_t\sim D_{\pi^*}\) i.i.d, while erroneous policies induce state distribution shift \(s_t\sim D_{\pi_\theta}\) during test.

A simple solution to this is called DAGGER, which iteratively asks the expert to provide feedback on the states visited by the policy.

Reward shaping

One immediate problem with learning the reward model is that the mapping \((R\to \pi^*)\) is not unique. One solution to this problem is provided in (Ng, Harada, and Russell 1999). In full generality, we consider additive transformations \(F(s, a, s')\) of the reward function.

Definition 4.3 (potential-based shaping function) A reward shaping function \(F:S\times A\times R\to \mathbb R\) is a potential-based shaping function if there exists a real-valued function \(\Phi:S\to \mathbb R\) suhch that \(\forall s\in S-\{s_0\}\), \[ F(s, a, s') = \gamma \Phi(s') - \Phi(s) \] where \(S-\{s_0\}=S\) if \(\gamma<1\).

Two remarks in order about the following theorem:

It includes scalar transformations \(r\mapsto \alpha\, r\) as a special case.
It uniquely identifies the symmetry group for \(r\mapsto r+F\), assuming that transition \(P\) can be picked arbitrarily under picking the gauge. Fixing the transition \(P\) and baseline \(r\) a priori, there might be a larger class of symmetries.

Theorem 4.1 (reward shaping theorem) The reward transformation \(r\mapsto r+F\) preserves the optimal policy for all transitions \(P\) and baseline reward \(r\) iff \(F\) is a potential-based shaping function. In other words:

Sufficiency: if \(F\) is potential-based, then every optimal policy under \(r\) is an optimal policy in \(r'=r+F\).
Necessity: if \(F\) is not potential-based, then there exists transition models \(P\) and reward function \(R\) such that no optimal policy under \(r'\) is optimal under \(r\).

Under this transformation, the value functions transform as \[ Q(s, a)\mapsto Q(s, a) - \Phi(s), \quad V(s) \mapsto V(s) - \Phi(s) \]

Sufficiency: pick a transformation affecting \(V^*\mapsto V^*+\pi\) which is independent of the policy

Let \(M, M'\) denote MDPs under \(r, r'=r+F\) respectively. Recall for \(M^*\) the Bellman optimality equations: \[ Q^*_M(s, a) = \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)}\left[ r(s, a, s') + \gamma \max_{a'\sim A} Q^*M(s', a') \right] \] Subtract \(\Phi(s)\) from both sides: \[\begin{align} Q^*_M(s, a) - \Phi(s) = \mathop{\mathbb{E}}\limits_{s'\sim P(\cdot\mid s, a)}\left[ r(s, a, s') + \gamma\, \Phi(s') - \Phi(s) + \gamma \max_{a'\sim A} \left[ Q^*M(s', a') - \Phi(s) \right] \right] \end{align}\] But this is exactly the Bellman optimality equation for \(M'\) with solution \[ Q^*_{M'}(s, a) = Q^*_M(s, a) - \Phi(s) \] Then any optimal policy for \(M\) satisfying \[ \pi^* = \operatorname*{arg\,max}_{\pi} V^*_M(s_0) = \operatorname*{arg\,max}_{\pi} V^*_M(s_0) - \Phi(s_0) = \operatorname*{arg\,max}_{\pi} V^*_{M'}(s_0) - \Phi(s_0) \] is also optimal for \(M'\).

Classical inverse RL

Definition 4.4 (linear features) Assuming that we have a feature function \(x:\mathcal S\times \mathcal A\to \mathbb R^n\) such that the reward is linear in features: \[ r(s, a) = w^T x(s, a), \quad w\in \mathbb R^n\text{ and } \|w_\infty\|_\leq 1 \] Fixing features a priori, the goal of reward learning will be to identify the weight vector \(w\) given a set of demonstrations.

Proposition 4.1 (optimal-policy learning ≈ feature matching) Given features \(x\) satisfying assumptions 4.4 and policy \(\pi\), define the expected discounted feature \(\mu_\pi: \mathbb R^n\) by \[ \mu_\pi = \mathop{\mathbb{E}}\limits_{\pi} \left[ \sum_{t=0}^\infty \gamma^t x(s_t, a_t)\mid s_0 \right] \] Assuming that \(r=w^Tx\), then \[ \|\mu_\pi - \mu_{\pi^*}\|_1\leq \epsilon \implies V^*(s_0) - V^\pi(s_0) \leq \epsilon \]

Unrolling the linear reward function, \(V^\pi\) can be rewritten as \[ V^\pi(s_0) = \mathop{\mathbb{E}}\limits_{\pi} \left[ \sum_{t=0}^\infty \gamma^t w^T r(s_t, a_t)\mid s_0 \right] = w^T \mu_\pi \] Using Holder’s inequality with \(\|w\|_\infty \leq 1\), we obtain \[ \|\mu_\pi - \mu_{\pi^*}\|_1\leq \epsilon \implies |w^T\mu_\pi - w^T \mu_{\pi^*}| \leq \epsilon \]

Definition 4.5 (classical IRL algorithm) Assuming \(r=w^Tx\) for features \(x\) given a priori:

Compute the optimal demonstration’s discounted mean features \(\mu_{\pi^*}\) from demonstration (proposition 4.1).
Since the optimal policy satisfies \(w^T \mu_{\pi^*} \geq w^T \mu_{\pi}\), initialize \(\pi\) and repeat until convergence:
- Optimize \(w\mapsto \operatorname*{arg\,max}_{\|w\|_\infty \leq 1 } w^T \mu_{\pi^*} - w^T \mu_\pi\).
- Iterate \(\pi \mapsto \operatorname*{arg\,max}_{\pi} w^T \mu_\pi\).

Max-entropy IRL

We develop the inverse RL framework below assuming the dynamics is known. See (Finn, Levine, and Abbeel 2016) for a fitted version.

Definition 4.6 (principle of max entropy in IRL) Assuming access to the following components:

Environment dynamics \(P\).
Expert empirical distribution \(\hat \rho = \frac 1 n \sum 1_{\tau_j}\).

Given a reward model \(r_\phi\), the max-entropy trajectory distribution compatible with \(r_\phi\) and the expert trajectories is specified by
\[ \rho^*_{r_\phi} = \max_\rho H(\rho) = -\sum_\tau \rho(\tau) \log \rho(\tau) \] subject to the following constraints:

Normalization: \(\sum_\tau \rho(\tau) = 1\).
Reward-equivalence let \(\hat r_\phi = \dfrac 1 N \sum r_\phi(\tau_j) = \mathop{\mathbb{E}}\limits_{\hat \rho} r_\phi(\tau) \in \mathbb R\) denote the expert reward under this reward model: \[ \mathop{\mathbb{E}}\limits_{\tau \sim \rho}r_\phi(\tau) = \sum_\tau \rho(\tau) \, r_\phi(\tau) = \hat r_\phi \]

Theorem 4.2 (max-entropy reduction)

Let \(B_{r_\phi, \lambda}\) denote the Boltzmann distribution \[ B_{r_\phi, \lambda} = \dfrac{e^{\lambda \, r_\phi(\tau)}}{Z(r_\phi, \lambda)}, \quad Z(r_\phi, \lambda) = \sum_\tau e^{\lambda r_\phi(\tau)}, \quad r_\phi(\tau)\in \mathbb R, \quad \lambda \in \mathbb R \] Further assume that \(r_\phi\) has linear degree of freedom i.e. \[ \max_{r_\phi} \max_\lambda f(\lambda r_\phi) = \max_{r_\phi} f(r_\phi) \]

Then the \(r_\phi\) (maximizing entropy over all distribution \(\rho\) which is \(r_\phi\)-reward compatible with \(\hat \rho\)) is equal to the one (maximizing likelihood of \(\hat \rho\) under Boltzmann distribution parameterized by \(r_\phi\)):

\[\begin{align} \operatorname*{arg\,max}_{r_\phi}\left[ \max_{\text{distribution }\rho} H(\rho) \quad \text{s.t. } \mathop{\mathbb{E}}\limits_{\rho} r_\phi(\tau) = \bar r(r_\phi) \right] = \operatorname*{arg\,max}_{r_\phi} -D(\hat \rho \| B_{r_\phi, \lambda=1}) \end{align}\]

The gradient of the RHS w.r.t \(\phi\) is \[\begin{align} \nabla_\phi \dfrac 1 N \sum \log B_{r_\phi, \lambda=1}(\tau_j) &= \nabla_\phi \left[ -\log Z_{r_\phi, \lambda=1} + \dfrac 1 N \sum r_\phi(\tau_j) \right] \\ &= \mathop{\mathbb{E}}\limits_{\hat \rho}\left[\nabla_\phi r_\phi(\tau)\right] - \mathop{\mathbb{E}}\limits_{B_{r_\phi, \lambda}} \left[\nabla_\phi r_\phi(\tau)\right] \end{align}\]

Proof idea: invoke Boltzmann lemma below for LHS. On RHS, expand and apply the linear degree of freedom to write \(\operatorname*{arg\,max}_{r_\phi}=\operatorname*{arg\,max}_{r_\phi}\max_\lambda\); the inner \(\max_\lambda\) reduces to the average-reward constraint on \(\lambda\)

Lemma 4.1 (Boltzmann solution) The constrained maximum \[ \left[\max_{\text{distribution }\rho} H(\rho) \quad \text{s.t. } \mathop{\mathbb{E}}\limits_{\rho} r_\phi(\tau) = \bar r(r_\phi)\right]\,\, = \log Z_{r_\phi, \lambda} + \lambda \bar r(r_\phi) \] The distribution achieving the maximum is Boltzmann \(\rho^* = B(r_\phi, \lambda)\), where the free parameter \(\lambda\) satisfies \(\mathop{\mathbb{E}}\limits_{B_{r_\phi, \lambda}} r_\phi(\tau) = \bar r(r_\phi)\).

Proof

Introducing two Lagrangian multipliers \(\lambda, \lambda_p\): the equations \(\partial_{\lambda }\mathcal L=\partial_{\lambda_p} \mathcal L = 0\) enforces the two constraints, respectively: \[\begin{align} \mathcal L &= - \sum_\tau \rho(\tau) \log \rho(\tau) - \lambda \left(\bar r(r_\phi) - \sum_\tau \rho(\tau) r_\phi(\tau) \right) + \lambda_p \sum_\tau \rho(\tau) \\ \partial_{\rho(\tau)} \mathcal L &= -\log \rho(\tau) + 1 + \lambda r_\phi(\tau) + \lambda_p = 0 \\ \log \rho(\tau) &= 1 + \lambda r_\phi(\tau) + \lambda_p \implies \rho^*(\tau) \propto e^{\lambda r_\phi(\tau)} \end{align}\] Proceding to calculate the entropy of this distribution: \[\begin{align} H(\rho^*) &= \sum_\tau \rho^*(\tau) \log \dfrac{Z_{r_\phi, \lambda}}{e^{\lambda\, r_\phi(\tau)}} = \sum_\tau \rho^*(\tau) \left[ \log Z - \lambda r_\phi(\tau) \right] \\ &= \log Z_{r_\phi, \lambda} + \lambda \mathop{\mathbb{E}}\limits_{\rho^*} r_\phi(\tau) = \log Z_{r_\phi, \lambda} + \lambda \bar r(r_\phi) \end{align}\]

Applying the lemma above, the LHS reduces to \[ \operatorname*{arg\,max}_{r_\phi}\left[ Z_{r_\phi, \lambda} + \lambda r(r_\phi) \quad \text{s.t. } \mathop{\mathbb{E}}\limits_{\tau \sim B_{r_\phi, \lambda}} r_\phi(\tau) = \bar r(r_\phi) \right] \] Expand the likelihood extremization on the RHS: \[\begin{aligned} \operatorname*{arg\,max}_{r_\phi} -D(\hat \rho \| B_{r_\phi, \lambda=1}) &= \operatorname*{arg\,max}_{r_\phi} \dfrac 1 N \sum_\tau \hat \rho(\tau) \log \dfrac{B_{r_\phi, \lambda=1}(\tau)}{\hat \rho(\tau)} \\ &= \operatorname*{arg\,max}_{r_\phi} \dfrac 1 N \sum_j \log B_{r_\phi, \lambda=1}(\tau_j) \\ &= \operatorname*{arg\,max}_{r_\phi} \left(\sum_j \lambda_{=1} r_\phi(\tau_j)\right) + \log Z_{r_\phi, \lambda=1} \\ &= \operatorname*{arg\,max}_{r_\phi} \left[ \lambda_{=1} \bar r(r_\phi) + \log Z_{r_\phi, \lambda=1}\right] \end{aligned}\] Recall that \(r_\phi\) has linear degree of freedom, so we can split the \(\operatorname*{arg\,max}_{r_\phi}\); the inner \(\max\) is maximized when \(\lambda\) satisfies \(\mathop{\mathbb{E}}\limits_{B_{r_\phi, \lambda}}r_\phi(\tau) = \bar r(r_\phi)\) (this can be seen by another application of Lagrange multipliers, or by recognizing the Legendre transform), so the RHS reduces to LHS: \[\begin{aligned} \operatorname*{arg\,max}_{r_\phi} \left[ \lambda_{=1} \bar r(r_\phi) + \log Z_{r_\phi, \lambda}\right] &= \operatorname*{arg\,max}_{r_\phi} \max_\lambda \left[ \lambda \bar r(r_\phi) + \log Z_{r_\phi, \lambda}\right] \\ &= \operatorname*{arg\,max}_{r_\phi}\left[ \log Z_{r_\phi, \lambda} + \lambda r(r_\phi) \quad \text{s.t. } \mathop{\mathbb{E}}\limits_{\tau \sim B_{r_\phi, \lambda}} r_\phi(\tau) = \bar r(r_\phi) \right] \end{aligned}\]

Salient points:

When \(r_\phi\) does not have linear degree of freedom, maximizing likelihood of the empirical distribution under \(r_\phi\)-parameterized Boltzmann distribution is not equivalent to maximizing the maximum reward-compatible distribution entropy.
Given a Boltzmann distribution of trajectories, the maximum-entropy optimal policy is \(\pi_\phi(a\mid s) \propto \exp Q^*(s, a)\), where \(Q^*\) solves MDP with reward \(r_\phi\).

Definition 4.7 (max-entropy IRL Algorithm) Given dynamics model \(P\) and expert demonstration \(\hat \rho \sim \{\hat \tau_j\}\), initialize scale-invariant \(r_\phi\) and repeat until convergence:

Given \(r_\phi\), use value interation to compute \(Q^*(s, a)\). The max-entropy optimal policy is then \(\pi_\phi(a\mid s) \propto \exp Q^*(s, a)\).
Update \(\phi\) in the direction of \(-\nabla_\phi D(\hat \rho \|B_{r_\phi, \lambda=1}) = \mathop{\mathbb{E}}\limits_{\hat \rho}\left[\nabla_\phi r_\phi(\tau)\right] - \mathop{\mathbb{E}}\limits_{B_{r_\phi, \lambda}} \left[\nabla_\phi r_\phi(\tau)\right]\).

Bibliography

Finn, Chelsea, Sergey Levine, and Pieter Abbeel. 2016. “Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization.” In International Conference on Machine Learning, 49–58. PMLR.

Ng, Andrew Y, Daishi Harada, and Stuart Russell. 1999. “Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping.” In Icml, 99:278–87. Citeseer.