Reinforcement Learning
2025-07-27
Preface
These notes accompany the summer 2025 self-learning for Reinforcement Learning. Reference resources include:
- Harvard CS1840’s textbook an Introduction to Reinforcement Learning by Alexander Cai.
- Stanford CS234 lecture notes and lectures by professor Emma Brunskill.
Chapter highlights
- Markov Decision Processes explores closed-form solutions to finite and infinite-horizon MDPs with known dynamics and rewards.
- Recursive and optimality operators; contraction properties.
- Control problem can be reduced to iterate (policy evaluation + greedy action), i.e. policy iteration.
- Sample-based MDP solutions explores MDP solutions with unknown dynamics, known rewards.
- Estimate procedures:
- Monte-Carlo: high-variance no bias, no Markov assumptions, converges to MSE-optimal estimates.
- Temporal difference: low variance high bias, need Markov conditions; converges to DP solution on empirical distribution
- Monte-Carlo policy iteration: MC + PI, online on-policy.
- SARSA: TD policy iteration, online on-policy
- Q-learning: action-value iteration; can use TD (canonical) or MC to approximate action-value function, then act greedily.
- Estimate procedures:
- Policy Methods explores an orthogonal approach by directly optimizing policies. Foundational policy gradient theorem 3.1 is improved in two ways:
- Inverse RL explores MDP solutions with unknown rewards, replacing them with estimates from expert demonstration; orthogonal to dynamics estimation methods.
- Main problem: identifiably problem of deducing rewards from expert policies; a partial answer is the reward shaping theorem 4.1.
- Foundational result is max-entropy IRL which has clean closed-form optimization.
- RL for LLMs: eliciting rewards from preferences; direct preference optimization.
- Exploration and bandits: regret framework for analyzing decision optimality. Single-state memoryless MDP serves as the foundational block for MDP exploration methods.
Salient themes:
- Methodological approach: exact solutions in strong models (finite MDPs with known dynamics). RL consists of a diverse toolkit for tackling relaxed assumptions.
- Large state-action space: use deep model function approximation.
- Unknown dynamics: use sampling; interpolate between temporal-difference and monte-carlo.
- Unknown rewards: reward modeling (e.g. using max-entropy).
- Exploration: entropy bonus, bandit-inspired optimism.
- Two foundational cornerstones in RL: (action-)value function and policy. The first is contingent upon DP reductions based on Markov assumptions, giving models more “bite” when assumptions hold.
- Accurate action-value function implies good policy: Q-learning.
- Policy gradient training can benefit from good value-function estimates.
- Applicable themes in life:
- Policy-gradient theorem: weighing individual actions by rollout rewards is unbiased despite seeming greedy.
- Improve by bootstrapping or baseline.
- Correct for off-policy by importance sampling.
- Be an optimist under uncertainty! Better to be battered by and learn from reality than missing out.
- Policy-gradient theorem: weighing individual actions by rollout rewards is unbiased despite seeming greedy.
Cool theorems
- Markov Decision Processes: contraction properties 1.3, optimality of \(Q^*\)-greedy policies 1.2. Convergence bound of VI and PI.
- Sample-based MDP solutions: NA, just definition.
- Policy methods: PG theorem 3.1, baseline theorem 3.2, relative policy difference theorem 3.7.
- Proof of relative policy performance theorem is a canonical example applying large-deviation theory and \(f\)-divergences.
- Inverse RL: reward shaping theorem 4.1, max-entropy reduction theorem 4.2.
- Proof theme: Boltzmann methods, and interchange between single and nested optimizations.
- RL for LLMs: DPO reduction; variational characterization of KL and using analytic solutions to inner optimizations.
- Bandits and exploration: Lai-Robbins lower bound 5.1 and UCB regret bound 5.3.
- Proof theme: split event into “typical” and “atypical” subevents according to likelihood ratio; typical subevents provide direct bound, and bound atypical subevents using Radon-Nikodym derivative & KL properties.