Reinforcement Learning

Self-learning notes for reinforcement learning at introductory graduate level. Covers classical MDP theory as well as modern toolkits applicable to relaxed assumptions and learning-based methods.

Sources include:

Harvard CS1840’s textbook an Introduction to Reinforcement Learning by Alexander Cai.
Stanford CS234 lecture notes and lectures by professor Emma Brunskill.

Chapter highlights {-}

Markov Decision Processes explores closed-form solutions to finite and infinite-horizon MDPs with known dynamics and rewards. a. Recursive and optimality operators; contraction properties. b. Control problem can be reduced to iterate (policy evaluation + greedy action), i.e. policy iteration.
Sample-based MDP solutions explores MDP solutions with unknown dynamics, known rewards.
- Estimate procedures:
  - Monte-Carlo: high-variance no bias, no Markov assumptions, converges to MSE-optimal estimates.
  - Temporal difference: low variance high bias, need Markov conditions; converges to DP solution on empirical distribution
- Monte-Carlo policy iteration: MC + PI, online on-policy.
- SARSA: TD policy iteration, online on-policy
- Q-learning: action-value iteration; can use TD (canonical) or MC to approximate action-value function, then act greedily.
Policy Methods explores an orthogonal approach by directly optimizing policies. Foundational policy gradient theorem is improved in two ways:
- Reducing variance: baseline theorem
  - The best baseline is state-value function; so good policy methods are contingent upon value method models.
- Using off-policy data: relative policy difference theorem.
  - Surrogate policy loss corrects for off-policy data contingent upon policy proximity. Natural extension to PPO-Clip algorithm.
  - V-trace corrects the observed values by (clipped) importance sampling; this corrects both value and policy losses.
Inverse RL explores MDP solutions with unknown rewards, replacing them with estimates from expert demonstration; orthogonal to dynamics estimation methods.
- Main problem: identifiably problem of deducing rewards from expert policies; a partial answer is the reward shaping theorem.
- Foundational result is max-entropy IRL which has clean closed-form optimization.
RL for LLMs: eliciting rewards from preferences; direct preference optimization.
Exploration and bandits: regret framework for analyzing decision optimality. Single-state memoryless MDP serves as the foundational block for MDP exploration methods.
- Lai-Robbins lower bound: fundamental lower bound on how much trial and error are needed.
- Upper-confidence bound bandits dominate lower-confidence bound ones. Be an optimist in life!
- UCB regret bound.
- Given priors (in a Bayesian setting), use Thompson sampling.

Salient themes:

Methodological approach: exact solutions in strong models (finite MDPs with known dynamics). RL consists of a diverse toolkit for tackling relaxed assumptions.
- Large state-action space: use deep model function approximation.
- Unknown dynamics: use sampling; interpolate between temporal-difference and monte-carlo.
- Unknown rewards: reward modeling (e.g. using max-entropy).
- Exploration: entropy bonus, bandit-inspired optimism.
Two foundational cornerstones in RL: (action-)value function and policy. The first is contingent upon DP reductions based on Markov assumptions, giving models more “bite” when assumptions hold.
- Accurate action-value function implies good policy: Q-learning.
- Policy gradient training can benefit from good value-function estimates.
Applicable themes in life:
- Policy-gradient theorem: weighing individual actions by rollout rewards is unbiased despite seeming greedy.
  - Improve by bootstrapping or baseline.
  - Correct for off-policy by importance sampling.
- Be an optimist under uncertainty! Better to be battered by and learn from reality than missing out.

Cool theorems {-}

Markov Decision Processes: contraction properties, optimality of $Q^*$-greedy policies. Convergence bound of VI and PI.
Sample-based MDP solutions: NA, just definition.
Policy methods: PG theorem, baseline theorem, relative policy difference theorem.
- Proof of relative policy performance theorem is a canonical example applying large-deviation theory and $f$-divergences.
Inverse RL: reward shaping theorem, max-entropy reduction theorem.
- Proof theme: Boltzmann methods, and interchange between single and nested optimizations.
RL for LLMs: DPO reduction; variational characterization of KL and using analytic solutions to inner optimizations.
Bandits and exploration: Lai-Robbins lower bound and UCB regret bound.
- Proof theme: split event into “typical” and “atypical” subevents according to likelihood ratio; typical subevents provide direct bound, and bound atypical subevents using Radon-Nikodym derivative & KL properties.

Nicholas Lyu

Chapter highlights {-}

Cool theorems {-}