Preface

These notes accompany the summer 2025 self-learning for Reinforcement Learning. Reference resources include:

Chapter highlights

  1. Markov Decision Processes explores closed-form solutions to finite and infinite-horizon MDPs with known dynamics and rewards.
    1. Recursive and optimality operators; contraction properties.
    2. Control problem can be reduced to iterate (policy evaluation + greedy action), i.e. policy iteration.
  2. Sample-based MDP solutions explores MDP solutions with unknown dynamics, known rewards.
    • Estimate procedures:
      • Monte-Carlo: high-variance no bias, no Markov assumptions, converges to MSE-optimal estimates.
      • Temporal difference: low variance high bias, need Markov conditions; converges to DP solution on empirical distribution
    • Monte-Carlo policy iteration: MC + PI, online on-policy.
    • SARSA: TD policy iteration, online on-policy
    • Q-learning: action-value iteration; can use TD (canonical) or MC to approximate action-value function, then act greedily.
  3. Policy Methods explores an orthogonal approach by directly optimizing policies. Foundational policy gradient theorem 3.1 is improved in two ways:
    • Reducing variance: baseline theorem 3.2.
      • The best baseline is state-value function; so good policy methods are contingent upon value method models.
    • Using off-policy data: relative policy difference theorem 3.3.
      • Surrogate policy loss corrects for off-policy data contingent upon policy proximity 3.7. Natural extension to PPO-Clip algorithm 3.8.
      • V-trace corrects the observed values by (clipped) importance sampling; this corrects both value and policy losses.
  4. Inverse RL explores MDP solutions with unknown rewards, replacing them with estimates from expert demonstration; orthogonal to dynamics estimation methods.
    • Main problem: identifiably problem of deducing rewards from expert policies; a partial answer is the reward shaping theorem 4.1.
    • Foundational result is max-entropy IRL which has clean closed-form optimization.
  5. RL for LLMs: eliciting rewards from preferences; direct preference optimization.
  6. Exploration and bandits: regret framework for analyzing decision optimality. Single-state memoryless MDP serves as the foundational block for MDP exploration methods.
    • Lai-Robbins lower bound 5.1: fundamental lower bound on how much trial and error are needed.
    • Upper-confidence bound bandits dominate lower-confidence bound ones. Be an optimist in life!
    • UCB regret bound 5.3.
    • Given priors (in a Bayesian setting), use Thompson sampling.

Salient themes:

  1. Methodological approach: exact solutions in strong models (finite MDPs with known dynamics). RL consists of a diverse toolkit for tackling relaxed assumptions.
    • Large state-action space: use deep model function approximation.
    • Unknown dynamics: use sampling; interpolate between temporal-difference and monte-carlo.
    • Unknown rewards: reward modeling (e.g. using max-entropy).
    • Exploration: entropy bonus, bandit-inspired optimism.
  2. Two foundational cornerstones in RL: (action-)value function and policy. The first is contingent upon DP reductions based on Markov assumptions, giving models more “bite” when assumptions hold.
    • Accurate action-value function implies good policy: Q-learning.
    • Policy gradient training can benefit from good value-function estimates.
  3. Applicable themes in life:
    • Policy-gradient theorem: weighing individual actions by rollout rewards is unbiased despite seeming greedy.
      • Improve by bootstrapping or baseline.
      • Correct for off-policy by importance sampling.
    • Be an optimist under uncertainty! Better to be battered by and learn from reality than missing out.

Cool theorems

  • Markov Decision Processes: contraction properties 1.3, optimality of \(Q^*\)-greedy policies 1.2. Convergence bound of VI and PI.
  • Sample-based MDP solutions: NA, just definition.
  • Policy methods: PG theorem 3.1, baseline theorem 3.2, relative policy difference theorem 3.7.
    • Proof of relative policy performance theorem is a canonical example applying large-deviation theory and \(f\)-divergences.
  • Inverse RL: reward shaping theorem 4.1, max-entropy reduction theorem 4.2.
    • Proof theme: Boltzmann methods, and interchange between single and nested optimizations.
  • RL for LLMs: DPO reduction; variational characterization of KL and using analytic solutions to inner optimizations.
  • Bandits and exploration: Lai-Robbins lower bound 5.1 and UCB regret bound 5.3.
    • Proof theme: split event into “typical” and “atypical” subevents according to likelihood ratio; typical subevents provide direct bound, and bound atypical subevents using Radon-Nikodym derivative & KL properties.