Preface

Course notes accompanying Fall 2024’s iteration of Information Theory: from Coding to Learning (Polyanskiy and Wu 2025). Mostly follows the textbook.

Recurring themes

Fundamentally, information and entropy captures how our uncertainty in a quantity changes after observations. Some recurring themes in the book are:

  1. Fundamental properties of information measures (e.g. entropy, mutual information, divergence, Fisher information) are:
    • Nonnegativity.
    • Monotonicity: joint provides more information than marginals. For especially well-behaved ones (entropy, KL, Fisher information) we expect additive decomposition, i.e. chain rule.
    • Data-processing inequality.
  2. Divergence as a fundamental concept in information theory.
    1. Quantifies difference between distributions; the choice of this quantification depends on our problem.
    2. Divergence gives rise to mutual information; monotonicity yields DPI.
  3. Convexity of \(f\) for \(f\)-divergence is mathematically equivalent to our information intuitions: data-processing inequality, monotonicity, etc.
    1. KL-divergence (and Rényi) is special because its \(\log\) form admits additive decomposition of joint divergence.
  4. Convexity of information measures correspond to variational characterizations, which are extremely useful because:
    1. They provide bounds: choose the varied quantity to be our friend!
    2. Provide tractable variational approximations using numerical optimization methods (e.g. VAE, GAN).
    3. Combined with tensorization and convexity properties, such results yield powerful asymptotics for high-dimensional problems.
  5. Mutual information as information radius, or center of gravity; capacity as a minimax saddle point.
  6. Divergence measures are extremely useful for bounding events: bound the event for a tractable distribution, bound the divergence between the tractable and given distribution, and we can bound the event for the given distribution.
  7. Operationally, entropy \(H\) is the answer to compression, mutual information \(I\) is the answer to universal compression, and divergence \(D\) is the answer to hypothesis testing.
  8. The Sanov principle establishes a fundamental bridge between probability and divergence.
  9. When ths noise is not i.i.d, without loss of generality we can whiten by compressing it.
  10. The nontrivial results in info theory provide asymptotic solutions to integer programming questions (e.g. channel coding, rate distortion) using information quantities, where convexity yield tensorization and tractable solutions.
    • Information problems are not amortizable by going from scalars to vectors, but by undoing the operation-information bridge, we obtain amortizable operational results which take advantage of vectorization.

Nontrivial results in this book include:

  1. Golden formula: variational characterization of mutual information.
  2. Saddle point characterization of divergence.
  3. Donsker-Varadhan (theorem 5.5): variational characterization of KL, extended to \(f\)-divergences in theorem 8.9.
  4. Harremoës-Vadja (theorem 8.8): one theorem to rule them all for \(f\)-divergence joint range.
  5. Most \(f\)-divergences are locally \(\chi^2\)-like about \(\lim_{\lambda \to 0^+} D_f(\lambda P + \bar \lambda Q \| Q)\) and decays quadratically (theorem 8.11).
  6. Kraft-McMillan (theorem 10.3): constructive bijection between inequality-satisfying code lengths and prefix-free codes.
  7. Duality between Bayes and minimax risk (proposition 9.1): at the heart of every minimax theorem is the hyperplane separation theorem.
  8. van-Trees inequality (theorem 9.5): using the variational \(\chi^2\) formula, obtain information bound for general (not just unbiased) estimators.
  9. Large-sample mean deviation reduces to information projection (theorem 11.9).

Insightful proof techniques

  1. Asymptotic analysis proofs: rewrite a difference as an integral, then apply the monotone convergence theorem.
  2. Unique extremality proofs: reduce the desired inequality to the information inequality (theorem 3.1) or a nonnegative information quantity (e.g. entropy).
    • Corollary 3.1, theorem 3.3, lemma 10.2; the latter two characterized Gaussian and geometric distributions.
    • Occasionally we need to introduce a tilted distribution based on both inputs, see Donsker-Varadhan (5.5) or
  3. Bounds on conditional quantities: consider the joint quantity and decompose in \(2\) different ways.
    • KL-DPI theorem 3.5.
  4. Meta-converse: instead of studying \(P_{X\hat X}\) for arbitrary \(\hat X\), introduce a tractable \(Q_{X\hat X}\), to bound an event, then use DPI (or divergence inequalities like Donsker-Varadhan) to bound the same event for \(P_{X\hat X}\) together with some divergence quantity.
    • Fano’s inequality 4.3.
  5. Convexity proofs: introduce an additional parameter \(\theta \sim \mathrm{Ber}(\lambda)\) to rewrite \(\lambda f(x)+\bar \lambda f(y)\) as a conditional quantity, then use decomposition.
  6. Counting bounds: to bound \(|\mathcal C|\), draw an element uniformly from \(\mathcal C\), then upper-bound the joint-entropy by marginal bound, Han’s inequality, or chain rule.
  7. Typicality methods (compression): break region into tractable region of large probability and difficult region with low probability.
    • \(\mathrm{Pr}[A]= \mathrm{Pr}[A, B] + \mathrm{Pr}[A, B^c] \leq \mathrm{Pr}[B] + \mathrm{Pr}[A, B^c]\) then bound \(\mathrm{Pr}[A, B^c]\) through atypicality of \(B^c\): soure coding distribution theorem 10.2.
  8. Break a single \(\sup\) into nested ones.
    • Solve the inner \(\sup\) analytically after introducing a redundant linear (in fact any parameterized) bijective transform: useful characterization of \(\chi^2\) (8.5) and KL (proposition 8.12).
    • Alternate inner and alter closed-form solution: EM algorithm.
  9. Large-sample asymptotic proofs: relate to an information measure, use their tensorization property and convexity to reduce to the single-letter case (theorem 11.9).
  10. The frameworks for understanding channel coding (lossy and lossless) as well as rate distortion are very similar.

Bibliography

Polyanskiy, Yury, and Yihong Wu. 2025. Information Theory: From Coding to Learning. Cambridge University Press.