Preface

These notes accompany the summer 2025 self-learning for Reinforcement Learning. Reference resources include:

Harvard CS1840’s textbook an Introduction to Reinforcement Learning by Alexander Cai.
Stanford CS234 lecture notes and lectures by professor Emma Brunskill.

Chapter highlights

Markov Decision Processes explores closed-form solutions to finite and infinite-horizon MDPs with known dynamics and rewards.
1. Recursive and optimality operators; contraction properties.
2. Control problem can be reduced to iterate (policy evaluation + greedy action), i.e. policy iteration.
Sample-based MDP solutions explores MDP solutions with unknown dynamics, known rewards.
- Estimate procedures:
  - Monte-Carlo: high-variance no bias, no Markov assumptions, converges to MSE-optimal estimates.
  - Temporal difference: low variance high bias, need Markov conditions; converges to DP solution on empirical distribution
- Monte-Carlo policy iteration: MC + PI, online on-policy.
- SARSA: TD policy iteration, online on-policy
- Q-learning: action-value iteration; can use TD (canonical) or MC to approximate action-value function, then act greedily.
Policy Methods explores an orthogonal approach by directly optimizing policies. Foundational policy gradient theorem 3.1 is improved in two ways:
- Reducing variance: baseline theorem 3.2.
  - The best baseline is state-value function; so good policy methods are contingent upon value method models.
- Using off-policy data: relative policy difference theorem 3.3.
  - Surrogate policy loss corrects for off-policy data contingent upon policy proximity 3.7. Natural extension to PPO-Clip algorithm 3.8.
  - V-trace corrects the observed values by (clipped) importance sampling; this corrects both value and policy losses.
Inverse RL explores MDP solutions with unknown rewards, replacing them with estimates from expert demonstration; orthogonal to dynamics estimation methods.
- Main problem: identifiably problem of deducing rewards from expert policies; a partial answer is the reward shaping theorem 4.1.
- Foundational result is max-entropy IRL which has clean closed-form optimization.
RL for LLMs: eliciting rewards from preferences; direct preference optimization.
Exploration and bandits: regret framework for analyzing decision optimality. Single-state memoryless MDP serves as the foundational block for MDP exploration methods.
- Lai-Robbins lower bound 5.1: fundamental lower bound on how much trial and error are needed.
- Upper-confidence bound bandits dominate lower-confidence bound ones. Be an optimist in life!
- UCB regret bound 5.3.
- Given priors (in a Bayesian setting), use Thompson sampling.

Salient themes:

Methodological approach: exact solutions in strong models (finite MDPs with known dynamics). RL consists of a diverse toolkit for tackling relaxed assumptions.
- Large state-action space: use deep model function approximation.
- Unknown dynamics: use sampling; interpolate between temporal-difference and monte-carlo.
- Unknown rewards: reward modeling (e.g. using max-entropy).
- Exploration: entropy bonus, bandit-inspired optimism.
Two foundational cornerstones in RL: (action-)value function and policy. The first is contingent upon DP reductions based on Markov assumptions, giving models more “bite” when assumptions hold.
- Accurate action-value function implies good policy: Q-learning.
- Policy gradient training can benefit from good value-function estimates.
Applicable themes in life:
- Policy-gradient theorem: weighing individual actions by rollout rewards is unbiased despite seeming greedy.
  - Improve by bootstrapping or baseline.
  - Correct for off-policy by importance sampling.
- Be an optimist under uncertainty! Better to be battered by and learn from reality than missing out.

Cool theorems

Markov Decision Processes: contraction properties 1.3, optimality of \(Q^*\)-greedy policies 1.2. Convergence bound of VI and PI.
Sample-based MDP solutions: NA, just definition.
Policy methods: PG theorem 3.1, baseline theorem 3.2, relative policy difference theorem 3.7.
- Proof of relative policy performance theorem is a canonical example applying large-deviation theory and \(f\)-divergences.
Inverse RL: reward shaping theorem 4.1, max-entropy reduction theorem 4.2.
- Proof theme: Boltzmann methods, and interchange between single and nested optimizations.
RL for LLMs: DPO reduction; variational characterization of KL and using analytic solutions to inner optimizations.
Bandits and exploration: Lai-Robbins lower bound 5.1 and UCB regret bound 5.3.
- Proof theme: split event into “typical” and “atypical” subevents according to likelihood ratio; typical subevents provide direct bound, and bound atypical subevents using Radon-Nikodym derivative & KL properties.