6.7480 Notes
2025-08-15
Preface
Course notes accompanying Fall 2024’s iteration of Information Theory: from Coding to Learning (Polyanskiy and Wu 2025). Mostly follows the textbook.
Recurring themes
Fundamentally, information and entropy captures how our uncertainty in a quantity changes after observations. Some recurring themes in the book are:
- Fundamental properties of information measures (e.g. entropy, mutual information, divergence,
Fisher information) are:
- Nonnegativity.
- Monotonicity: joint provides more information than marginals. For especially well-behaved ones (entropy, KL, Fisher information) we expect additive decomposition, i.e. chain rule.
- Data-processing inequality.
- Divergence as a fundamental concept in information theory.
- Quantifies difference between distributions; the choice of this quantification depends on our problem.
- Divergence gives rise to mutual information; monotonicity yields DPI.
- Convexity of \(f\) for \(f\)-divergence is mathematically equivalent
to our information intuitions: data-processing inequality, monotonicity, etc.
- KL-divergence (and Rényi) is special because its \(\log\) form admits additive decomposition of joint divergence.
- Convexity of information measures correspond to variational characterizations,
which are extremely useful because:
- They provide bounds: choose the varied quantity to be our friend!
- Provide tractable variational approximations using numerical optimization methods (e.g. VAE, GAN).
- Combined with tensorization and convexity properties, such results yield powerful asymptotics for high-dimensional problems.
- Mutual information as information radius, or center of gravity; capacity as a minimax saddle point.
- Divergence measures are extremely useful for bounding events: bound the event for a tractable distribution, bound the divergence between the tractable and given distribution, and we can bound the event for the given distribution.
- Operationally, entropy \(H\) is the answer to compression, mutual information \(I\) is the answer to universal compression, and divergence \(D\) is the answer to hypothesis testing.
- The Sanov principle establishes a fundamental bridge between probability and divergence.
- When ths noise is not i.i.d, without loss of generality we can whiten by compressing it.
- The nontrivial results in info theory provide asymptotic solutions to
integer programming questions (e.g. channel coding, rate distortion) using
information quantities, where convexity yield tensorization and tractable solutions.
- Information problems are not amortizable by going from scalars to vectors, but by undoing the operation-information bridge, we obtain amortizable operational results which take advantage of vectorization.
Nontrivial results in this book include:
- Golden formula: variational characterization of mutual information.
- Saddle point characterization of divergence.
- Donsker-Varadhan (theorem 5.5): variational characterization of KL, extended to \(f\)-divergences in theorem 8.9.
- Harremoës-Vadja (theorem 8.8): one theorem to rule them all for \(f\)-divergence joint range.
- Most \(f\)-divergences are locally \(\chi^2\)-like about \(\lim_{\lambda \to 0^+} D_f(\lambda P + \bar \lambda Q \| Q)\) and decays quadratically (theorem 8.11).
- Kraft-McMillan (theorem 10.3): constructive bijection between inequality-satisfying code lengths and prefix-free codes.
- Duality between Bayes and minimax risk (proposition 9.1): at the heart of every minimax theorem is the hyperplane separation theorem.
- van-Trees inequality (theorem 9.5): using the variational \(\chi^2\) formula, obtain information bound for general (not just unbiased) estimators.
- Large-sample mean deviation reduces to information projection (theorem 11.9).
Insightful proof techniques
- Asymptotic analysis proofs: rewrite a difference as an integral, then apply the monotone convergence theorem.
- Unique extremality proofs: reduce the desired inequality to the information inequality (theorem 3.1) or a nonnegative information quantity (e.g. entropy).
- Bounds on conditional quantities: consider the joint quantity
and decompose in \(2\) different ways.
- KL-DPI theorem 3.5.
- Meta-converse: instead of studying \(P_{X\hat X}\) for
arbitrary \(\hat X\), introduce a tractable \(Q_{X\hat X}\),
to bound an event, then use DPI (or divergence inequalities like Donsker-Varadhan)
to bound the same event for \(P_{X\hat X}\) together with some
divergence quantity.
- Fano’s inequality 4.3.
- Convexity proofs: introduce an additional parameter
\(\theta \sim \mathrm{Ber}(\lambda)\) to rewrite
\(\lambda f(x)+\bar \lambda f(y)\) as a conditional quantity,
then use decomposition.
- See section on convexity.
- Counting bounds: to bound \(|\mathcal C|\), draw an element uniformly from \(\mathcal C\), then upper-bound the joint-entropy by marginal bound, Han’s inequality, or chain rule.
- Typicality methods (compression): break region into tractable region of
large probability and difficult region with low probability.
- \(\mathrm{Pr}[A]= \mathrm{Pr}[A, B] + \mathrm{Pr}[A, B^c] \leq \mathrm{Pr}[B] + \mathrm{Pr}[A, B^c]\) then bound \(\mathrm{Pr}[A, B^c]\) through atypicality of \(B^c\): soure coding distribution theorem 10.2.
- Break a single \(\sup\) into nested ones.
- Large-sample asymptotic proofs: relate to an information measure, use their tensorization property and convexity to reduce to the single-letter case (theorem 11.9).
- The frameworks for understanding channel coding (lossy and lossless) as well as rate distortion are very similar.
Bibliography
Polyanskiy, Yury, and Yihong Wu. 2025. Information Theory: From Coding to Learning. Cambridge University Press.