Reinforcement Learning

Self-learning notes for reinforcement learning at introductory graduate level. Covers classical MDP theory as well as modern toolkits applicable to relaxed assumptions and learning-based methods.

Sources include:

Chapter highlights {-}

  1. Markov Decision Processes explores closed-form solutions to finite and infinite-horizon MDPs with known dynamics and rewards. a. Recursive and optimality operators; contraction properties. b. Control problem can be reduced to iterate (policy evaluation + greedy action), i.e. policy iteration.
  2. Sample-based MDP solutions explores MDP solutions with unknown dynamics, known rewards.
    • Estimate procedures:
      • Monte-Carlo: high-variance no bias, no Markov assumptions, converges to MSE-optimal estimates.
      • Temporal difference: low variance high bias, need Markov conditions; converges to DP solution on empirical distribution
    • Monte-Carlo policy iteration: MC + PI, online on-policy.
    • SARSA: TD policy iteration, online on-policy
    • Q-learning: action-value iteration; can use TD (canonical) or MC to approximate action-value function, then act greedily.
  3. Policy Methods explores an orthogonal approach by directly optimizing policies. Foundational policy gradient theorem is improved in two ways:
  4. Inverse RL explores MDP solutions with unknown rewards, replacing them with estimates from expert demonstration; orthogonal to dynamics estimation methods.
    • Main problem: identifiably problem of deducing rewards from expert policies; a partial answer is the reward shaping theorem.
    • Foundational result is max-entropy IRL which has clean closed-form optimization.
  5. RL for LLMs: eliciting rewards from preferences; direct preference optimization.
  6. Exploration and bandits: regret framework for analyzing decision optimality. Single-state memoryless MDP serves as the foundational block for MDP exploration methods.
    • Lai-Robbins lower bound: fundamental lower bound on how much trial and error are needed.
    • Upper-confidence bound bandits dominate lower-confidence bound ones. Be an optimist in life!
    • UCB regret bound.
    • Given priors (in a Bayesian setting), use Thompson sampling.

Salient themes:

  1. Methodological approach: exact solutions in strong models (finite MDPs with known dynamics). RL consists of a diverse toolkit for tackling relaxed assumptions.
    • Large state-action space: use deep model function approximation.
    • Unknown dynamics: use sampling; interpolate between temporal-difference and monte-carlo.
    • Unknown rewards: reward modeling (e.g. using max-entropy).
    • Exploration: entropy bonus, bandit-inspired optimism.
  2. Two foundational cornerstones in RL: (action-)value function and policy. The first is contingent upon DP reductions based on Markov assumptions, giving models more “bite” when assumptions hold.
    • Accurate action-value function implies good policy: Q-learning.
    • Policy gradient training can benefit from good value-function estimates.
  3. Applicable themes in life:
    • Policy-gradient theorem: weighing individual actions by rollout rewards is unbiased despite seeming greedy.
      • Improve by bootstrapping or baseline.
      • Correct for off-policy by importance sampling.
    • Be an optimist under uncertainty! Better to be battered by and learn from reality than missing out.

Cool theorems {-}