5 RL for LLMs

This section explores reward modeling and optimization methods which are widely applicable to LLMs. The theme is the joint task of modeling rewards and optimizing for the reward:

Bradley-Terry model 5.1: model the logits of preferences.
RLHF: use preferences to train a reward model \(r(s, a)\), use PPO to optimize \(r\) under KL constraint.
- Reward is single-shot: \((s, a)=\) (prompt, response).
- Assuming the reward model is accurately elicited & generalized, RLHF-model can perform better than demonstration.
DPO: reward modeling + PPO reward-maximization can be reduced to a single step:
- Similar idea in (Rajeswaran et al. 2019): nested optimization can be simplified when the inner loop has suitable analytic solution, often possible when the inner objective is proximity-constrained.
- Only works in single-shot reward i.e. contextual bandit assumption.
- DPO = reward modeling + PPO when assuming optimal PPO.

Preference-based reward modeling

Definition 5.1 (Bradley-Terry model) Consider \(K\) actions \(b_1, \dots, b_k\). Assume that pairwise preferences are made noisily according to \[ P(b_j \succ b_k) = \dfrac{e^{r(b_j)}}{e^{r(b_k)} + e^{r(b_j)}} = \sigma[r(b_j) - r(b_k)] \] In short, preferences are encoded by additive logits given by \(r\); note that this model is transitive.

Definition 5.2 (winners) A choice \(b_j\) is:

Condorcet winner if \(P(b_j \succ b_{\forall k\neq j}) > 0.5\).
Copeland winner if it has the highest number of pairwise victories.
Borda winner if it maximizes the expected score (heaviside step function). s

Definition 5.3 (preference model for trajectories) Assume that the probability that trajectory \(\tau^{(1)}\) is preferred over \(\tau^{(2)}\) is \[ P(\tau^{(1)} \succ \tau^{(2)}) = \sigma \left( \sum_j r^{(1)}_t - r^{(2)}_t \right) \] The reward model is trained by maximizing the likelihood of the observed preferences under the preference model.

Definition 5.4 (RLHF) The RLHF pipeline consists of four parts:

Pretraining a large-language model.
Supervised Fine-Tuning: collecting expert demonstrations (prompt, behavior) and finetune the model to \(\pi_{\mathrm{ref}}\).
Reward model training: sample multiple model outputs for each prompt. Train a reward model \(r_\phi(s, a)\) using labeler ranking.
PPO: optimize KL-constrained reward w.r.t. \(\pi_\theta\); KL is used to prevent overfitting: \[ r(s, a) = r_\phi(s, a) - \beta \log \dfrac{\pi_\theta(a\mid s)}{\pi_{\mathrm{ref}}(a\mid s)} \]

Note that in canonical RLHF, the state-action are sequence-level, not token-level. This simplifies the situation to a contextual bandit, instead of full-MDP, setting. In this case, we take advantage of RL since we’re sampling from the sequence distribution.

Direct Preference Optimization

Recall the RLHF objective. In the contextual bandit setting, fixing state (prompt) \(s\), the objective maximizes \[\begin{align} % \mathop{\mathbb{E}}\limits_{\substack{s\sim \mathcal D \\ a\sim \pi_\theta(\cdot \mid s)}} \left[ r_\phi(s, a) - \beta D(\pi(\cdot \mid s)\| \pi_{\mathrm{ref}}(\cdot \mid s)) \right] \end{align}\] This is equivalent to identifying for each context (prompt) \(s\): \[ \pi^*(a\mid s) = \operatorname*{arg\,max}_{\pi(a\mid s)} \mathop{\mathbb{E}}\limits_{a\sim \pi_\theta(\cdot \mid s)}\left[ \dfrac 1 \beta r_\phi(s, a) - D(\pi(\cdot \mid s)\| \pi_{\mathrm{ref}}(\cdot \mid s)) \right] \] The key idea for DPO is that preference-based reward modeling + reward optimization = direct preference optimization by leveraging a closed-form relation between the reward and optimal policy.

Fixing \(s\), recall the Gibbs variational principle. \[\begin{align} \log \mathbb E_Q e^{f(X)} = \sup_P \mathbb E_P f(X) - D(P\|Q) \end{align}\] where the unique maximizer is the tilted distribution \(P=Q^f\), where \[ Q^f(dx) = e^{f(dx) - \psi_f} Q(dx), \quad \psi_f = \log \mathbb E_q e^{f(X)} \] Assuming regularity condition on \(f(a) = r_\phi(s, a)\), substitute \(X\mapsto a, f(a)=r_\phi(s, a), P\mapsto \pi(\cdot\mid s), Q\mapsto \pi_{\mathrm{ref}}(\cdot \mid s)\) to obtain the closed-form solution \[ \pi^*(a\mid s) = \dfrac{1}{Z(s)} \exp \left[\frac 1 \beta r(s, a)\right] \pi_{\mathrm{ref}}(a\mid s), \quad Z(s) = \sum_a \pi_{\mathrm{ref}}(a\mid s) \exp \left[\frac 1 \beta r(s, a)\right] \] Rearrange to rewrite the reward function in terms of policy; note that \(Z(s)\) is still reward dependent, so this is an implicit equation: \[ r(s, a) = \beta \left[ \log \dfrac{\pi^*(a\mid s)}{\pi_{\mathrm{ref}}(a\mid s)} + \log Z(s) \right] \] Under the Bradley-Terry model 5.1, given preference dataset \(\{(s_j, a^1_j) \succ (s_j, a^2_j)\}\), the loss for the reward model can be rewritten directly in terms of the optimal policy: \[\begin{align} \mathcal L &= -\dfrac 1 N \sum_j \sigma \left[ r(s^1_j, a^1_j) - r(s^2_j, a^2_j) \right] \\ &= -\dfrac 1 N \sum_j \sigma\left[ \beta \log \dfrac{\pi^*(a_1\mid s)}{\pi_{\mathrm{ref}}(a_1\mid s)} - \beta \log \dfrac{\pi^*(a_2\mid s)}{\pi_{\mathrm{ref}}(a_2\mid s)} \right] \end{align}\]

Definition 5.5 (DPO algorithm) The DPO pipeline consists of four parts:

Pretraining a large-language model.
Supervised Fine-Tuning: collecting expert demonstrations (prompt, behavior) and finetune the model to \(\pi_{\mathrm{ref}}\).
DPO: collect preference dataset \(\{(s_j, a^1_j) \succ (s_j, a^2_j)\}\) and maximize the DPO loss directly: \[ \theta \mapsto \theta + \alpha \nabla_\theta \dfrac 1 N \sum_j \sigma\left[ \beta \log \dfrac{\pi^*(a_1\mid s)}{\pi_{\mathrm{ref}}(a_1\mid s)} - \beta \log \dfrac{\pi^*(a_2\mid s)}{\pi_{\mathrm{ref}}(a_2\mid s)}\right] \]

Bibliography

Rajeswaran, Aravind, Chelsea Finn, Sham M Kakade, and Sergey Levine. 2019. “Meta-Learning with Implicit Gradients.” Advances in Neural Information Processing Systems 32.