12 Lecture notes

Oct 2: Fisher information, classical minimax estimation

Agenda:

  1. \(\chi^2\) variational characterization.
  2. \(1\)-parameter families; minimax rate.
  3. HCR, and Fisher information.
  4. Cramer-Rau; van Trees inequality.

Key takeaways:

  1. \(R_n^*\) is the minimax rate of parameterization.
  2. LeCam-Hajek theory: \(R_n^* = \dfrac{1+o(1)}{n \min_\theta I_F(\theta)}\).
  3. To obtain more well-behaved inequalities, expand a single extremum into a nested one (e.g. scalar multiple) then solve the closed form of the inner optimization.
  4. All \(f\)-divergences are locally quadratic in parameteric families with Hessian given by Fisher information.
  5. Cramer-Rau is an application of the variational characterization of \(\chi^2\).

Main content

Recalling the variational characterization: \[ D_f(P\|Q) = \sup_g \mathbb E_p g - \mathbb E_Q f^*\circ g \] where the convex conjugate is given by \[ f^*(h) = \sup_{t\in \mathbb R} th - f(t), \quad f(t) = \sup_{h\in \mathbb R} th - f^*(h) \] Recall that Donsker-varadhan is not linear in \(\mathbb E_Q\), but there is a standard trick to rewrite the expectation.

We now apply the variational characterization to \(\chi^2\);

Proposition 12.1 (variational characterization of χ²) \(\chi^2(P \|Q) = \sup_{h:\mathcal X\to \mathbb R} \mathbb E_p h(X) - \mathbb E_Q \left[ h(X) + \dfrac{h(X)^2}{4} \right]\)

Expanding this out, the first term is very useful; but the second is not. The first step is to break a single extrema into two: \[\begin{align} \chi^2(P \|Q) &= \sup_h \left[ \mathbb E_p h - \mathbb E_Q h \right] - \dfrac 1 4 \mathbb E_Q h^2 \\ &= \sup_g \sup_{\lambda\in \mathbb R} \lambda \left( \mathbb E_P g - \mathbb E_Q g \right) - \dfrac{\lambda^2}{4} \mathbb E_Q g^2 = \sup_g \dfrac{(\mathbb E_p g - \mathbb E_Q g)^2}{\mathbb E_q g^2} \end{align}\] As a consequence, we obtain \[ \left( \mathbb E_P g - \mathbb E_Q g \right)^2 \leq \chi^2(P \|Q) \mathbb E_Q g^2 \] In fact, this equation is invariant under \(g\mapsto g + c\), yielding \[ \left( \mathbb E_P g - \mathbb E_Q g \right)^2 \leq \chi^2(P \|Q) \mathrm{Var}_Q g^2 \]

Exercise: \(\forall g>0\), we have \(\mathbb E_P g \leq 2\mathbb E_Q g + 2_? H^2(P, Q)\).

One-parameter families; minimax rates

Statisticians care about sub-manifolds of the probability simplex.

For one-parameter families, we typically have \[ P^\theta(dx) = p^\theta(x) \mu(dx) \] For discrete, \(\mu\) is counting; for real-valued, \(\mu\) is Lebesgue.

Parameter estimation: given \(X_1, \cdots, X_n \sim P^\theta\); find function \(\hat \theta(X_1, \cdots, X_n)\) such that \(\hat \theta \approx \theta\); here \(\approx\) is defined w.r.t. a risk function \[ R(\hat \theta, \theta) = \mathbb E_{X^n} (\theta - \hat \theta(X^n))^2 \] Note that the first argument is a function, while the second is a parameter coordinate.

Example: \(\hat \theta_0 = \dfrac 1 n \sum X_j\) for Bernoulli parameter estimation; we can compute \[ R(\hat \theta, \theta) = \dfrac{\theta \bar \theta}{n} \] Recall that it’s MLE and unbiased. To get mid of the \(\theta\)-dependence, we can consider \[ R^{\mathrm{max}}(\hat \theta) = \sup_\theta R(\hat \theta_1, \theta) \]

Consider a naive estimator \(\hat \theta_1 = 1/2\) (this has definite bias but not variance) and \[ \hat \theta_\lambda = \lambda \hat \theta_1 + \bar \lambda \hat \theta_{\mathrm{MLE}} \] One can compute risk = bias^2 +variance^2 and choose an optimal \(\lambda_n^*= \dfrac 1 {1 + \sqrt n}\); this allows one to optimize the risk everywhere.

Theorem The optimal shrinkage estimator \(\hat \lambda_{\lambda_n^*}\) saturates the minimax risk \[ \inf_{\hat \theta} R^{\mathrm{max}}(\hat \theta) = \dfrac 1 {4(1 + \sqrt n)^2} \] Key idea: obtain MLE; identify points of worst performance (minimum Fisher information), then bias towards it.

The optimal minimax risk profile (minimax rate) is \[ R_n^* = \inf_{\hat \theta} \sup_\theta R(\hat \theta, \theta) \]

HCR inequality; Fisher information

Proposition 12.2 \(\forall \hat \theta\) and \(\theta, \theta_0\) we have \[ R_n^* \geq \mathrm{Var}_{P^\theta} \geq \dfrac{ \left(\mathbb E_{P^\theta} \hat \theta - \mathbb E_{P^{\theta_0}} \hat \theta\right)^2 }{ \chi^2(P^{\theta \otimes n}\| P^{\theta_0 \otimes n}) } \]

Proof: The first inequality follows from bias-variance decomposition. For the second, take \(g(X^n) = \hat \theta(X^n)\) in the variational characterization of \(\chi^2\).

To bring Fisher information into the picture, consider \[ \chi^2 (P^\theta \| P^{\theta_0}) = \sum \dfrac{P^\theta(x)^2}{P^{\theta_0}(x)} - 1 \] Take \(\theta = \theta_0 + \epsilon\) for small \(\epsilon\); Taylor-expand to obtain \[\begin{align} P^\theta(x) = P^{\theta_0}(x) + \partial_{\theta }P^\theta \big|_{\theta = \theta_0} (x)\epsilon + \cdots \end{align}\] Substitute this into the expression for \(\chi^2\) to obtain \[ \chi^2 = \sum_x \dfrac{\left[ P^{\theta_0}(x) + \epsilon \partial_{\theta }P^\theta(x) \right]^2}{P^{\theta_0}(x)} - 1 + o(\epsilon^2) \] Expanding the term, the linear term is \(0\) (pull \(\partial_{\theta}\) out): \[ \sum_x \partial_{\theta }P^{\theta}(x) \big|_{\theta = \theta_0} = 0 \] The null term sums to \(1\), cancelling the \(-1\), yielding the local definition of \(\chi^2\) \[ \chi^2 = \epsilon^2 \sum_x \dfrac{\left[\partial_{\theta }P^{\theta_0}(x)\right]^2} {P^{\theta_0}(x)} + o(\epsilon^2) \]

Recall that the Fisher information is given by \[ \mathcal J_F(\theta; \{P^\theta\}_{\theta \in \Theta}) = \int \dfrac{\left[\partial_{\theta }P^\theta(x)\right]^2}{P^\theta(x)} \, \mu(dx) \] Fisher information is just the second-order derivative of \(\chi^2\).

There are two notions of distance: Fisher information tells us, when we nudge \(\theta \mapsto \theta + \epsilon\), how much is the statistical distance between \(P^\theta\) and \(P^{\theta + \epsilon}\).

Moreover, Fisher information is additive under tensorization: KL is additive under tensorization and apply the locality principle.

Theorems about Fisher information require a lot more assumptions.

Oct 7: Data compression

Theorem 12.1 (local expansion of χ) For \(\theta_0 = 0\), when the following conditions hold:

  1. \(\theta \in [0, \tau)\).
  2. There exists \(\dot p^\theta(x)\) satisfying \[ P^\theta(x) = P^0(x) + \int_0^\theta \dot p^t(x)\, dt \]
  3. The fisher information is defined everywhere in a connected interval open interval about \(\theta_0\): \[ \int dx\, \sup_{0\leq t < \tau} \dfrac{\dot p^t(x)^2}{p^0(x)} < \infty \]

Then \(\chi^2(P^{\theta_0 + \delta} \| P^{\theta_0}) + \delta^2 \mathcal J_F(\theta_0) + o(\delta^2)\).

The quadratic local behavior of \(\chi^2\) requires not only finite Fisher information at \(\theta_0\); it also requires \(\mathcal J_F\) to be finite almost everywhere on a nontrivial interval.

In summary, \(\chi^2\) is nice for the “good” cases where \(\mathcal J\) is finite a.e. However, for some irregular cases it’s best to use \(H^2(P^{\theta_0 + \delta}, P^\theta)\), whose local behavior is always determined by the local Fisher information.

A location family is a one-parameter family for which the parameter controls the displacement of a fixed density \(\nu\), e.g. \(\mathcal N(\theta, 1)\); for this family, \(\mathcal J(\theta) = \mathcal J(0)\), which we can abbreviate as \[ \mathcal J(\nu) = \int \dfrac{(\nu')^2}{\nu} \, dx \] Note that Fisher information \(\mathcal J_F\) is originally defined for a one-parameter family; with the notation \(\mathcal J(\nu)\), we’re assuming the one-parameter family to be the location family of the given density.

Theorem 12.2 (van Trees inequality) Under regularity assumptions on \(P^\theta\), for every estimator \(\hat \theta(X)\) and prior \(\pi\) \[ \mathbb E_{\theta \sim \pi} \mathbb E_{X\sim P^\theta} \left[ [\theta - \hat \theta(X)]^2 \right] \geq \dfrac 1 {\mathcal J(\pi) + \mathbb E_{\theta \sim \pi} \mathcal J_F(\theta)} \]

Corollary 12.1 For \(n\) i.i.d. observations, for every \(\pi\) \[ R_n^* \geq \dfrac 1 {\mathcal J(\pi) + n\mathbb E_{\theta \sim \pi} \mathcal J_F(\theta)} \]

van Trees is important important because it provides a lower bound on minimax risk based on local quantities \(\mathcal J_F(\theta)\).

Two takeaways for information theorists:

  1. In Bayesian setting, apply information theory techniques to the joint P_{, X}$.
  2. In local neighborhoods, \(\chi^2\) satisfies an additive chain rule (since it’s close to KL).

Van Trees replaced \(\mathcal J_F\) with the expectation of \(\mathcal J_F\), and it applies to every (instead of unbiased) estimators.

Oct 9: data compression II

Overview:

  1. Review
  2. Distribution of fixed length
  3. Arithmetic encoder
  4. Stationary, ergodic data sources
  5. Lempel-Ziv (adaptive, universal compression)

Main takeaways:

  1. AEP.
  2. Optimal compression is possible using (1) arithmetic encoding and (2) optimal next-token prediction.

Information theory is about bounding hard, intractable operational quantities with mathematically analyzable quantities.

Review

Assume \(P_X\) known and ordered \(P_X(1)\geq P_X(2)\geq \cdots\). The optimum compressor \(f^*(x)\) satisfies \[ l(f^*(X)) = \lceil \log_2 X\rceil \] We also proved that the expected optimum compression length satisfies \[ \mathbb El(f^*(X)) \approx H(X) \] In fact, this is upper-bounded by \(H(X)\). But this is not an algorithm!

  1. Sorting the distribution is intractible.
  2. It’s variable-length but not prefix-free! We’re assuming one-shot compression (comma is for free). Howevver, prefix free code, the optimal expected bound is lower-bounded by \(H(X)\).
  3. Proof idea: typicality; break region into typical (tractable) and atypical regions with vanishing probability.
  4. Asymptotic equipartition property: i.i.d. implies law of large numbers (in log-space).

Distribution of compression length is distributed (up to constants) as the entropy density. This is great because entropy density tensorizes.

Proposition 12.3 (distribution of compression length) Define random variable \(L^* = l(f^*(X))\). Denote the entropy density \(i_X(a) = -\log P_X(a)\); the optimal compressor should compress symbol \(a\) to roughly this length: \[ \mathrm{Pr}[i(X) \leq k] \leq \mathrm{Pr}[L^* \leq K] \leq \mathrm{Pr}[i(X) \leq k + \tau] + 2^{1-\tau} \] This holds for every \(k\in \mathbb Z_+, \tau>0\).

Proof: Left-bound is easy: \[ L^*(x) = \lceil \log_2 X\rceil \leq \log 2 |X| \leq -\log P_X(x) = i_X(x) \] To bound the second term, decompose the probability \[\begin{align} \mathrm{Pr}[L^* \leq k] &= \mathrm{Pr}[L^* \leq k, i(x) \leq k+\tau] + \mathrm{Pr}[L^* \leq k, i(x) > k+\tau] \\ &\leq \mathrm{Pr}[i_X(x) \leq k+\tau] + (\cdots) \end{align}\] The second term is bounded by the number of strings which achieves this \(2^{k+1}\) times the maximum probability they’re obtaining \(2^{-k-\tau}\), yielding \(2^{1-\tau}\).

Corrolary: if, for some sequence of r.v., the normalized entropy rate converges in distribution to \(U\), then the normalized optimal compression length (for asymptotically large block length) also converges to \(U\).

Another corollary: if \(S_j\sim P_S\) i.i.d., then \[ \dfrac 1 n i_{S_1^n}(S_1^n) = -\dfrac 1 n \log P_{S^n}(S^n) = -\dfrac 1 n \sum_{j=1}^n \log P_S(s_j) \to H(S) \] This implies that the expectation of the optimal compression length for i.i.d. source \(X\) converges to \(H(X)\) in the limit of asymptotically large block lengths.

This is a nontrivial result, because the optimal compressor is a very freely-specified object, but we are able to bound its behavior very neatly.

In particular, recall that the optimal-compressor maps highest-probability atoms to the empty set, but we see from the asymptotic Gaussian distribution of the compression length that they have almost negligible density. This is a demonstration of the asymptotic equipartition property, which states that for i.i.d. sources, the overwhelming number of sequences have the same probability given by \(e^{H(X)}\).

Arithmetic encoder

Given a general \(\{X_j\}\)-process and wishing to compress \(X_1^n\).

First consider i.i.d process. Order the alphabet and recursively partition the interval \([0, 1]\) so that each interval has length equal to its probability. The trick is to find the largest dyadic interval (recursive binary partition) that fits inside the interval of the message. For example: \[ [0, 1]\to \emptyset, \quad [6\cdot 2^{-3}, 7\cdot 2^{-3}]\to 110 \] In the limit that the codestring goes to infinity, the distribution of binary expansion will be uniform.

Fact for the compression length of arithmetic encoder for i.i.d. terms: \[ \log \dfrac 1 {P_{X^n}(x^n)} \leq l(f_{\mathrm{ae}}(x^n) \leq \log_2 \dfrac 1 {P_{X^n}(x^n)} + 2 \] The arithmetic encoder is additionally sequential: it does not need to consume the full string to start outputting compression; the same holds for the decompressor.

Implementing the arithmetic encoder for general non-i.i.d. distributions just replace subsequent intervals by the marginals \(P_{X_n \| X^{(n-1)}}\).

This means that, if we can sequentially predict the marginal \(P_{X_n \|X^{(n-1)}}\) very well (e.g. next-token prediction LLM), then we can close-to-optimal compress a non-i.i.d. distribution by combining this with the arithmetic encoder.

Theorem 12.3 (Shannon-McMillan-Breimen) A.e.p. holds w.r.t. the entropy rate if \(\{X_j\}\) is a stationary (ensures existence of the entropy rate) ergodic process.

Shannon’s proof: every stationary ergodic process can be arbitrarily approximated by a \(m\to \infty\)-order Markov chain.

Lempel-Ziv

Central question: how to compress well without \(P_{X^n}\)?

Lemma 12.1 (Katz's lemma) Given a stationary ergodic process \((X_{j\in \mathbb Z})\). Define \(L = \inf\{t>0: X_{-t}=X_0\}\), then \[ \mathbb E[L | X_0=u] = \mathrm{Pr}[X_0=u]^{-1} \]

Proof: consider the probability that we don’t see \(u\) when we look back for \(k\) steps, using stationarity: \[\begin{align} \mathrm{Pr}[L>K, X_0=u] &= \mathrm{Pr}[X_0=u, X_{-1}\neq U, \cdots, X_{-k}\neq u] \\ &= \mathrm{Pr}[X_k=u, X_{k-1}\neq U, \cdots, X_0\neq u] = \mathrm{Pr}[E_k] \end{align}\] todo Another key point is that for stationary ergodic processes, \(\mathrm{Pr}[\bigcup E_k] = 1\).

Using Katz’s lemma, we can do an unbiased estimation of \(\mathrm{Pr}[X_0=u]\) by looking back.

Do probability mixtures satisfy all of the local Fisher regularity conditions?

Oct 28: Binary hypothesis testing

Agenda:

  1. Finish universal compression.
  2. Define binary hypothesis testing; define \(R(P, Q)\).
  3. Stein’s lemma.

More on universal compression

Recall that universal compression wants to compress \(P_{X^n}\) to entropy, where \(P_{X^n}\) is unknown and drawn from a class \(\Pi\).

The fundamental limit of compression is the redundancy of class \(\Pi\), defined by \[ R_n^*(\Pi) = \inf_{Q_{X^n}} \sup_{P_{X^n}\in \Pi} \mathbb E_{X^n\sim P_{X^n}} \log \dfrac 1 {Q_{X^n}(X^n)} - H(X^n) \] Note the order of \(\inf\) and \(\sup\)! Else the “redundancy” would be trivial.

It turns out that for the uniform mixture of all Bernoullis approximately, for all \(\theta\). \[ Q_{X^n} = \int_0^1 d\theta\, \mathrm{Ber}(\theta)^{\otimes n} \] satisfies \(D(\mathrm{Ber}(\theta)^{\otimes n} \| Q_{X^n}) \leq \log n\) To see this, for non-extremal \(\theta, \theta'\) we have \[ d(\theta \| \theta') \approx \dfrac 1 2 (\theta - \theta')^2 \] The distribution \(Q_{X^n}\) is the universal probability assignment. (or maximally ignorant

Universal data compression (probability assignment) \(\approx\) choosing the best prior.

The most important theorem: redundancy is equal to the following capacity: \[ R_n^*(\Pi) = \sup_{\pi \in \mathcal P(\Pi)} I(\theta; X^n) \]

Jeffrey’s (objective) prior: given smoothness conditions for a family \(\Theta\), we have \[\begin{align} R_n^*(\Theta) &= \dfrac d 2 \log \dfrac{n}{2\pi e} + \log \int_{\Theta} \sqrt{\det \mathcal J_F(\theta)}\, d\theta + o(1) \\ \pi_n^* &\to \pi_\infty^* \text{ weakly } = \dfrac 1 Z \sqrt{\det \mathcal J_F(\theta)}\, d\theta \end{align}\] The objective prior is invariant to reparameterizations. Importantly, for Bernoullis the Jeffreys prior is \[\begin{align} \pi^*(\theta) = \dfrac 1 {\theta(1 - \theta)}\quad k=2, \text{ else } \mathrm{Direchlet}(1/2, \dots, 1/2). \end{align}\] The marginal estimator \(Q_{X_t|X_1^{t-1}}\) turns out to correspond to the add-half estimator (KT code).

Connection between universal compression and sequential prediction: Recalling the setup of sequential prediction, for each step the predictor is penalized by the negative log likelihood (with minimizer given by the true distribution). From learning theory, the natural metric for this problem (of a complexity of a class \(\Pi\)) is \[ \sup_{P_X\in \Pi} \mathbb E\sum_{t=1}^n \log \dfrac 1 {\tilde P_t(X_t; X_1^{t-1})} - \log \dfrac 1 {P_{X_t|X_1^{t-1}}} \] If some estimator achieves \(o(n)\) for the preceding quantity, then the class is learnable since the estimator is asymptotically as good as the oracle. It turns out that \[ \inf_{\tilde P} \text{regret} = R_n^*(\Pi). \] Relation to in-context learning: training phase allows for representation for different generating procedures (e.g. Jeffreys prior), and context chooses the posterior which specializes to the suitable data-generating process.

Every regret-minimizing estimator is an in-context learner. This is related to a theorem for density estimation (related to the Young-Barron bound): given \(X_1, \cdots X_n\sim P\in \Pi\) i.i.d, we wish to find \(\tilde P\) which will estimate \(P\) very well.

Theorem 12.4 For every i.i.d. class \(\Pi\) as above, there exists estimator such that \[ \mathbb E\mathrm{TV}^2 \leq \mathbb ED(P\|\tilde P_{\text{this is random}}) \leq \dfrac{R_{n+1}^*(\Pi)}{n+1} \]

This is a very insightful theorem since the right-hand side quantity can be upper-bounded by the redundancy theorem. Universal data compression (information) reduces to the problem of universal probability assignment, which is simultaneously the core of density estimation (statistics) and sequential prediction (learning).

Minimum description length principle: the more likely (class) of data-generating process is the one under whose universal compressor compresses the data down to a smaller length; this is equivalent to point-point hypothesis testing after collapsing the classes down to their universal prior.

Binary hypothesis testing

We are interested in simple v. simple hypothesis testing i.e.  distinguishing beween \(\theta_1\neq \theta_2\in \Theta\) (Neyman-Pearson); equivalently, distinguishing between \(P_{X^n}\) and \(Q_{X^n}\). Our goal is to design a binary estimator \(Z=Z(X^n)\in \{0, 1\}\), with \(0\) denoting \(X^n\sim P\) and \(1\) denoting \(X^n\sim Q\) (the first serious treatment originated with radars detecting incoming planes). Two associated quantities:

  1. \(\alpha = P[Z=0]\): probability of success under \(P\).
  2. \(\beta = Q[Z=0]\): probability of failure under \(Q\).

Theorem 12.5 (Stein's lemma) Among all tests with \(\alpha \geq 1 - \epsilon\), the smallest possible \(\beta = e^{-nD(P\|Q) + o_\epsilon(n)}\) assuming \(X_j^n\sim P\) or \(Q\).

Several comments:

  1. This theorem is not symmetric w.r.t. \(P\leftrightarrow Q\).
  2. The dependence upon \(\epsilon\) is captured in \(o_\epsilon(n)\).

To reason about binary hypothesis testing, define \[ \mathcal R(P, Q) = \{(P[Z=0], Q[Z=0])\} \]

Proposition 12.4 \(\mathcal R\subset [0, 1]^2\) is closed, convex, symmetric w.r.t. \((1/2, 1/2)\), and \[ \mathcal R = \overline{\mathrm{co}\, \mathcal R_{\mathrm{det}}(P, Q) } \] where \(\mathcal R_{\mathrm{det}}\) denotes the set of deterministic joint ranges.

As a corollary, the problem can be simplified by noting that \[ \mathcal R_{\mathrm{det}} = \{(P[E], Q[E]), \quad E\in \mathcal X\} \] For the Bernoulli case, \(E\) takes values in \(\{\emptyset, 0, 1, \mathcal X\}\). The first and last cases correspond to \((0, 0)\) and \((1, 1)\).

Perspective on the proof of the NP-lemma: to understand \(\mathcal R_{\mathrm{det}}\), it suffices to understand its extremal points, which corresponds to the problem \[ \sup_Z P[Z=0]_{=\alpha} - e^\gamma Q[Z=0]_{=\beta} \] The parameterization \(e^\gamma\) simply assures positivity. We “scan” over the supporting hyperplanes with a certain offset. The upshot is that BHT boils down to computing \(P[\log P/Q > \gamma]\) and \(Q[\log P/Q < \gamma]\) for a range of \(\gamma\)’s.

The “fatter” this joint range is, the more distinguishable the two distributions are; identical distributions correspond to the line \(\alpha=\beta\) with zero measure. This motivates the following results:

Proposition 12.5 \(d(\alpha \| \beta) \leq D(P\|Q)\) for \(\alpha, \beta \in \mathcal R\). This results from applying DPI to the estimator \(Z\).

Nov 4. Channel coding

Agenda:

  1. Finish Sanov’s theorem, large deviation theory, and I-projection.
  2. Define error-correction codes; example using BSC.
  3. Weak converse bound.

BHT, large deviations theory

To properly understand hypothesis testing, we need to understand the large-deviations question: to compute \(E_0, E_1\), we need to compute \(E\) in \[ \mathrm{Pr}[\bar T < \gamma] \sim e^{-nE(\gamma)} \]

Theorem 12.6 (Sanov principle) under regularity conditions on \(\mathcal E\), we have \(\mathrm{Pr}[\hat P_n\in \mathcal E] = e^{-nE + o(n)}\), where \[ E = \inf_{R\in \mathcal E} D(R\|P) \]

Extremely important proof below!

Lemma 12.2 If \(\mathcal X\) is finite and \(\mathcal E\) is convex with nonempty interior, then Sanov holds.

First consider an upper bound: \(\mathrm{Pr}[\hat P_n\in \mathcal E] \leq e^{-nE}\). By \(P_{X^n} = P^{\otimes n}\), define \(\tilde P_{X^n} = P_{X^n | E_n}\). Note that the KL divergence \[ D(\tilde P_{X^n} \| P_{X^n}) = \mathbb E_{P_{X^n}} \log \dfrac{P_{X^n|E_n}(x)}{P_{X^n}(x)} = \log \dfrac 1 {P_{X^n}(E_n)} \] This is the extension of Theorem 11.9. This implies a lot of bounds! Chernoff, Berstein, Benett, Okamota, etc. In particular, \[ \forall k, \quad \mathrm{Pr}[\mathrm{Bin}(n, p)\geq k] \leq \exp \left[-n d(d/n\|p)\right] \]

Idea to proving lower bounds: prove for the easy case, then lift to the general case using information measures.

To prove the lower bound, if \(P\in \mathrm{int}(E)\), then using the law of large numbers yield \(\mathrm{Pr}[\hat P_n\in \mathcal E]\to 1\). Otherwise, take \(R\in \mathrm{int}(\mathcal E)\), then \(R^{\otimes n}[\hat P_n\in \mathcal E]\to 1\). Apply DPI on the indicator \(1_{E_n}\).

I-projections

Take \(X\in \mathbb R^d\) samples from \(X\sim P\); define the information projection by \[ I(\gamma) = \inf D(R\|P) \] where infimum is over \(\mathbb E_P[X]=\gamma\).

Proposition 12.6 (tilting is optimal if exists) Suppose there exists \(\lambda \in \mathbb R^d\) such that \(\mathbb E_{X\sim P_\lambda}[X]=\gamma\), then for every \(R\) satisfying \(\mathbb E_R[X]=\gamma\), we have \[ D(R\|P) = D(R\|P_\lambda) + D(P_\lambda \| P) \geq D(P_\lambda \| P). \]

Proof: \(D(R\|P) = \mathbb E_R \log \dfrac{dR}{dP} \dfrac{dR}{dP_\lambda}\) and regroup by cross-terms. Recognize \(\mathbb E_R \log dP_\lambda/dP = \mathbb E_R[\lambda^T X - \psi(\lambda)]\).

Theorem 12.7 (I-projection on a hyperplane)

  1. If \(\exists R: \mathbb E_R[X]=\gamma\) and \(D(R\|P)<\infty\), then \(\gamma\in \mathrm{csupp}(P)\).
  2. \(\exists P_\lambda\) with \(\mathbb E_R[X]=\gamma \iff \gamma\in \mathrm{int}(\mathrm{csupp}\, P)\).

Here the convex support of \(P\) is the intersection of all sets \(S\) such that \(P[S]=1\) and \(S\) is closed and convex; to visualize this, imagine the mixture of two disjointly supported distributions.

In other words, information projection is solvable iff \(\gamma\) is in csupp, and it is solvable by tilting if it \(\gamma\) is in the interior.

Theorem 12.8 (tradeoff in error exponents) \(E_1(E_0) = \inf_{D(R\|P)<E_0} D(R\|Q)\).

Relation to Wald’s SPRT principle by slightly modifying the problem formulation: commit to an expected instead of exact number of samples. For composite v.s. composite tests, Hellinger computation becomes important.

Hypothesis testing is still a thriving topic! Instead of specifying the exact generating distributions \(P, Q\), only samples are provided (likelihood-free inference).

Channel coding

We started with information measures, next found in compression that compression is described by entropy; consequently, the optical error decay in hypothesis testing is provided by divergence. The problem of channel coding is solved by mutual information.

Given a channel \(P_{Y|X}:\mathcal X\to \mathcal Y\), we wish to construct \(f:[M]\to \mathcal X\) and \(g:\mathcal Y\to [M]\cup \{e\}\) such that the maximal error rate (across all \(j\)) \(\leq \epsilon\). For generality, allow randomized \(f, g\). \[ W\xrightarrow f X\xrightarrow{P_{Y|X}} Y\xrightarrow g \hat W \] This is a maximal probability of error code, in comparison to average probability of error codes.

  1. If \(f\) is given, the design of \(g\) is essentially \(M\)-ary H. T. For \(M>2\) there is no uniquely optimal criteria. However, if we postulate \(j\sim [M]\) uniformly, then the optimal solution is given by maximum-likelihood decoder.

Note that the maximum-likelihood decoder tesselates the decoder-input space according to 1-nearest neighbor.

In information theory, almost all of the work is designing a great distribution.

Nov 6, Channel coding II

  1. \(M^*(\epsilon)\).
  2. Define a channel (again); fundamental limit of finite-blocklength coding, and capacity.
  3. Weak converse.

Recall our definition of a channel last time:

  1. Channel \(P_{Y|X}: \mathcal X\to \mathcal Y\).
  2. Encoder \([M]\to \mathcal X\), decoder \(\mathcal Y\to [M]\cup \{e\}\).

Graphically, this can be represented as \[ W\xrightarrow f X \xrightarrow {P_{Y|X}} Y\xrightarrow g \hat W \]

We need to put some assumptions on \(W\), so let’s take \(W\sim \mathrm{Unif}[M]\). Overwhelmingly, the encoder \(P_{X|W}\) will be deterministic, and most of the times \(P_{\hat W|Y}\) is deterministic; one does sometime need to randomize when minimizing the minimax risk, however.

Definition 12.1 (error criteria) Consider two types of the probability of error reminiscent of the case for risk in statistical estimation:

  1. Average probability of error: \(P_{e, \mathrm{avg}} = \mathrm{Pr}[\hat W\neq W]\).
  2. Maximum probability of error: \(P_{e, \mathrm{max}} = \max_{j\in [M]} \mathrm{Pr}[\hat W\neq j|W=j]\).

We say that \((f, g)\) is an \((M, \epsilon)\)-code if if \(P_{e, \mathrm{avg}}\leq \epsilon\). It is an \((M, \epsilon)_{\mathrm{max}}\) code if \(P_{e, \mathrm{max}}\leq \epsilon\).

Definition 12.2 (single-shot fundamental limit) Fixing a channel, \(M^*(\epsilon)\) is the maximum alphabet size such that there exists a \((M, \epsilon)\)-code. The max-error limit \(M^*_{\mathrm{max}}\) is defined similarly.

Here \(M^*(\epsilon)\) is related the number of messages one can reliably separate. In general, this is a computationally intractable quantity, but we will be able to bound it quite tightly.

Recall from last time that, given \(f\), the optimal \(g\) which minimizes \(P_{e, \mathrm{avg}}\) is the MLE. This follows from the MLE coinciding with MAP under the uniform prior.

  1. If \(f, g\) are deterministic, then \(f(j)=c_j\in \mathcal X\) is a codeword. An encoder can be equivalently be viewed as a collection of codewords.
  2. The pre-images \(g^{-1}(j)=D_j\subset \mathcal Y\) are the decoding regions.

When \(\mathcal X=\mathcal Y\), one should think of codewords as little singleton “points” in space, while the decoding regions are partitions of the space which attempts to best separate the codewords in such a way as to be robust to noise perturbations.

Putting two codewords in space is equivalent to binary hypothesis testing; if we require \(\epsilon\to 0\), then \(M^*(\epsilon)\to 0\).

Theorem 12.9 (Fano's bound) \(\log M^*_{\mathrm{max}}(\epsilon) \leq \log M^*(\epsilon) \leq \dfrac{\sup_X I(X; Y) + \log 2}{1-\epsilon}\).

Proof: Again, to prove impossibility bounds begin with the easy case. Recall \(W\to X\to Y\to \hat W\), consider \(Q_{Y|X}=Q_Y\) which does not look at its input, then \[ Q[W=\hat W] = \dfrac 1 M \sum_j Q[\hat W = j] = \dfrac 1 M \] Under the \(Q\)-channel, then \(1-\epsilon_Q = \dfrac 1 M\). Apply DPI to lift the statement about \(Q\)-channel to \(P\)-channel, process \((W, X, T, \hat W)\) with \(1_{W=\hat W}\) to obtain \(\mathrm{Ber}(1/M)\) and \(\mathrm{Ber}(1-\epsilon)\). Then \[ D(P_{WXY\hat W} \| Q_{WXY\hat W}) \geq d(1-\epsilon \| 1/M) \] Noting the Markov structure, apply the chain rule to obtain \[\begin{align} D(P_{WXY\hat W} \| Q_{WXY\hat W}) &= D(P_W\|Q_W)_{=0} + D(P_{X|W} \| Q_{X|W} \| P_W)_{=0} + D(P_{Y|X} \| Q_Y | P_X) + D(P_{\hat W|Y} \| Q_{\hat W|Y} | P_Y)_{=0} \end{align}\] The zeroed terms are due to us only replacing the \(X\to Y\) part of the Markov chain, then \(\forall Q\), we obtain \[ (1-\epsilon)\log M \leq D(P_{Y|X} \| Q_Y | P_X) + \log 2 \] Take \(\inf\) over \(Q_Y\) and apply the golden formula to obtain mutual information. Note that \(P_X=f\circ U_{[M]}\) is usually hard to compute, so take \(\sup_X\) to obtain the desired inequality.

In information theory, a second definition of a channel is a triplet consisting of (1) (single-letter) alphabet \(\mathcal A\), (2) output alphabet \(\mathcal B\), with (3) a family of Markov kernels \(P_{B^n|A^n}\) for each \(n\geq 1\). In this sense a channel is interpreted as a sequence of Markov kernels, and \(n\) is the block length.

Two examples of additive channels: 1. Gaussian channel: \(\mathcal A=\mathcal B=\mathbb R\), and \(P_{B^n|A^n} = \mathcal N(0, \sigma^2 I_n)\). 2. Binary symmetric channel \(\mathrm{BSC}_\delta\): \(\mathcal A=\mathcal B=\{0, 1\}\), and \(P_{B^n | A^n}(b^n|a^n) = \delta^d(1-\delta)^{n-d}\), where \(d=d_M(a^n, b^n)\) is the Hamming distance between the two strings. - Equivalently, \(B^n = A^n + Z^n\mod 2, \quad Z_j\sim \mathrm{Ber}(\delta)\) i.i.d.

More generally, a discrete stationary memoryless channel (DMC) is just a tensorized channel which acts i.i.d on the input components.

Definition 12.3 (fundamental limit of channel coding, capacity) For any channel, an \((M, \epsilon)\) code for \(P_{B^n|A^n}\) is called a \((M, n, \epsilon)\) code. The capacity \(C\) of the channel is defined as \[\begin{align} C_\epsilon &= \liminf_{n\to \infty} \dfrac 1 n \log M^*(n, \epsilon), \quad C = \lim_{\epsilon \to 0^+} C_\epsilon. \end{align}\]

Proposition 12.7 (almost no difference between max v.s. average) \(\forall \tau \in (0, 1)\) we have \[ (1-\tau)M^*(n, \tau \epsilon) \leq M^*_{\mathrm{max}}(n, \epsilon) \leq M^*(n, \epsilon) \]

Proof: The second inequality is trivial. For the first, take an \((n, M, \tilde \epsilon)\) code and define \(\lambda_j = \mathrm{Pr}[\hat W\neq j|W=j]\), then \[ \dfrac 1 M \sum_j \lambda_j \leq \tilde \epsilon \implies \text{ number of $j$ s.t. } \lambda>\tilde \epsilon/\tau \text{ is less than } \tau M \] For every average-prob error code, we can apply the same idea behind Markov’s inequality to select a max-probability sub-code.

Corollary 12.2 \(C_\epsilon \geq C_\epsilon^{\mathrm{max}} \geq \lim_{\delta \to 0^-} C_{\epsilon - \delta} \implies C^{\mathrm{max}} = C\).

One perspective on \(C\) is that it is the maximal rate of asymptotically reliable communication. In other words, it is the supremum over all rates \(R\) such that there exists a sequence of \(\delta_n, \epsilon_n\to 0\), \((n, e^{n(R-\delta_n)}, \epsilon_n)\)-codes.

Proof: exercise??!

Definition 12.4 (information capacity) The information capacity \(C^{(i)}\) of a channel is \[ C^{(i)} = \liminf_{n\to \infty} \dfrac 1 n \sup_{P_{A^n}} I(A^n; B^n) \] Compared to the actual capacity (integer programming problem), this is easy!

Proposition 12.8 (capacity of BSC) For \(\mathrm{BSC}_\delta\), we obtain \(C^{(i)} = \log_2 - h(\delta)\).

Proof: upper-bound by tensorization of mutual information.

Theorem 12.10 (relation between information and actual capacity) Fano’s bound implies
\(C_\epsilon \leq \dfrac{C^{(i)}}{1-\epsilon} \implies C\leq C^{(i)}\).

For the BSC channel, the optimal decoder \(g_{\mathrm{MLE}}\) just looks for the codeword which is the fewest number of flips away from the input. Approaching this from a different perspective, the average number of flips for transmitting \(c_1\in \{0, 1\}^n\) is \(n\delta\); concentration becomes sharp as \(n\to \infty\). This motivates the construction of the bounded distance encoder \[ g_{\mathrm{BD}}(y) = \begin{cases} 1 & \text{if $c_j$ is a unique codeword in $B_r(y)$} \\ e & \text{otherwise} \end{cases} \] Shannon had the idea to construct random codewords \(c_1, \dots, c_M\) uniformly from the Hamming cube. Then \[\begin{align} \mathbb E\epsilon_2(c_1, \dots, c_M) &\leq \sum_{j=2}^M \mathrm{Pr}[d_M(c_j, Y) \leq r | X=c_1] = (M-1) \mathrm{Pr}[d_M(c_1, Y) \leq r] \\ &= \dfrac{M-1}{2^n} | B_r(0)| \end{align}\] Note crucially that \(C_j \perp\!\!\!\perp Y\). The second error rate \(\epsilon_2\) vanishes if \(M\ll 2^{nC^{(i)}}\). This implies, for BSC, \(C = C^{(i)} = \log 2 - h(\delta)\).

Nov 18, Channel Coding III

Agenda:

  1. Finish DT bound.
  2. DT bound with input constraints.
  3. General channels with input constraints.
  4. Capacity of Gaussian channels.

TODO for notes:

  1. Equivalence between maximum and average error.
  2. DT bound.

Other questions:

  1. More intuition for the strong v.s. weak converse.
  2. Explain \(\liminf\), also why is \(C_0\) a combinatorial problem, different from \(C_{\to 0}\).
  3. Wasserstein distance.

Pset questions:

  1. Pointwise maximization.
  2. Existence of the extension.
  3. Idea for applying the meta-converse.

Recall the information density: given a joint distribution \(P_{XY}\), the information density is the joint.

For the BSC, the channel only gives half the required quantitites for computing information density (i.e. the channel). We still need the input distribution. Take \(P_{X^n}\) to uniform on the Hamming cube, then \[ i(X^n; ^n) = \log \dfrac{P_{Y^n|X^n}}{P_{Y^n}} = \log \dfrac{\delta^{d_M} \bar \delta^{-n-d_M}}{2^{-n}} = n\log (2\bar \delta) - d_M \log \dfrac{\bar \delta}{\delta} \] This makes explicit the connection that ML decoder is equivalent to minimizing the Hamming distance \(d_M\).

  • For memoryless channels, the information density tensorizes.
  • This allows us to apply LLN (or large-deviations theory).

We can generalize the notion from stationary memoryless channels, since information stability (tensorization and concentration about average) is sufficient to prove capacity.

Theorem 12.11 (random coding DT bound) \(\forall P_{Y|X}\) and \(P_X, M\geq 1\), there exists \((M, \epsilon)\)-code such that \[ \epsilon \leq \mathbb E\exp -\max\left[0, i(X; Y) - \log \dfrac{M-1}{2}\right] \]

Proof: Treating the codebook as a random variable, a decoding error of message \(j\) can happen if:

  1. \(i(C_j; Y) \leq \gamma\): the info density of the true codeword is not enough. In BSC, this is probability that Hamming distance between true message and received message is larger than the threshold.
  2. There exists another codeword which exceeds \(\gamma\) prior to \(j\). Since we are taking the average over random codebooks, this terms becomes tractable. In BSC, this corresponds to some other codewords existing in the thresholded ball.

Proceeding to bound the first term, \[ P_e = \sum_{x, y} P_{XY}(x, y) 1 \left\{ P_{XY} \leq e^\gamma P_{\bar X Y} \right\} + \dfrac{M-1}{2} P_{\bar XY}(x, y) 1\left\{ P_{XY} > e^\gamma P_{\bar XY} \right\} \] To minimize this, the optimal selection is \(\gamma = \log \dfrac{M-1}{2}\); this construction of events selects the minimum of the two contributions.

Perspective on information theory: when proving results, complexity do not matter. When results are significant, someone will find a computationally feasible way : )

Decoding with constraint

The capacity of the AWGN without any constraint is \(\infty\), since we can space inputs out significantly large enough so that the added noise cause negligible overlap.

The constraint \(\|C\|^2 \leq nP\) yields finite capacity \(C=\dfrac 1 2 \log(1+P)\).

We wish to find a codebook \(c_1, \dots, c_M\) such that \(P_e(C) \leq \epsilon\), and that \(\|C_j\|^2 \leq nP\). In short, we have to place our codeword in a sphere of radius \(\sqrt{nP}\). Each codeword will be roughly spent to a sphere of radius \(\|\mathcal N(0, I_n)\| \approx \sqrt n\). The output will land with high probability within the sphere with radius \[ \mathbb E\|C + \mathcal N(0, I_n)\|^2 \approx n(1+p) \] How many spheres \(\sqrt n B^{n-1}\) can we pack inside \(\sqrt{n(1+p)}B\)? An upper bound is \((1+p)^{n/2}\).

Definition 12.5 (channel with constraints) Given a channel and a cost \(c:\mathcal X\to \mathbb R_+\), an encoder-decoder \((f, g)\) is an \((n, M, \epsilon, P)\) code if \[ \dfrac 1 n \sum_j \mathrm{Pr}[g(y)\neq j|X=f(j)] \leq \epsilon, \quad \forall j, c_n(f(j)) \leq nP \] Note that we are amortizing w.r.t. blocklength and randomness of the encoder, but the constraint should be uniformly obeyed by all elements of the alphabet.

For AWGN, \(c(x)=x^2\). Further, for memoryless channels, the constrained capacity is \[ C^{(I)}(P) = \sup_{\mathbb Ec(X_1)\leq P} I(X_1; Y_1) \]

Proposition 12.9 (properties of capacity-cost function)

  1. \(P\mapsto C^{(I)}(P)\) is concave, increasing, and continuous on interior.
  2. \(C(P)\leq C^{(I)}(P)\).

Proof: Concavity follows by averaging the capacity-achieving distributions. The second part follows from applying the weak converse. Take a \((n, M, \epsilon, P)\)-code, .

Nov 20. Quantization

Agenda:

  1. Finish capacity of Gaussian cahannel with cost
  2. Quantization: scalar case
  3. Definition of vector case

Channel coding with constraint:

\[ C^{(I)}(P) = \liminf_{n\to \infty} \dfrac 1 n \sup_{ \mathbb E\left[\frac 1 n \sum c(X_t)\right] } I(X^n; Y^n) \] By (1) the concavity of the capacity-cost function, and (2) tensorization of MI, we can tensorize both the extremized quantity as well as the condition.

Due to the concavity of the capacity-cost function, maximum communication per unit cost is achieved by having vanishing rate since \(C^{(I)}(P) / P\) is maximized \(P\to 0\).

Theorem 12.12 (Fano's inequality under constraint) \(C(P) \leq C^{(I)}(P)\).

The information stability criterion \[ \dfrac 1 n i(X^n; Y^n)\xrightarrow{\mathbb P} C^{(I)}(P) \] is equivalent to demanding the equivalent analogue of law of large numbers holding for non-stationary channels.

For AWGN, we have \(C^{(I)}(P) = \dfrac 1 2 \log(1+P)\), and recalling our previous discussion, \[ \sup_P \dfrac{\frac 1 2 \log(1+P)}{P} = \dfrac{\log e}{2} = -1.6\mathrm{dB} \dfrac{\mathrm{bit}}{\mathrm{Joule}} \] is the a fundamental limit in deep-space communication.

Water-filling

Consider a non-stationary memoryless Gaussian channel \[ Y_t = X_t + Z_t, \quad Z_t \sim \mathcal N(0, \sigma_t^2) \] Using tensorization, we obtain \[ I(X^n; Y^n) \leq \sum_t I(X_t; X_t + Z_t) \] with inequality saturated by independent Gaussians. Fixing the power allocation \(P_t\) for each letter, we obtain \[ I(X^n; Y^n) \leq \dfrac 1 2 \sum_{t=1}^n \log \left( 1 + \dfrac{P_t}{\sigma_t^2} \right) \text{ subject to } \sum P_t \leq n P \] This is a convex problem. Using Lagrange multipliers, we obtain

Consider a non-white Gaussian noise channel \(Y_t = X_t + Z_t\), where \(Z_1, \dots, X_n, \dots\) is a zero-mean stationary Gaussian process, which means:

  1. \(\mathbb EZ_t^2 = \sigma^2, \forall t\).
  2. \(\mathbb E[Z_t Z_{t+k}] = \psi(k)\) is a function of \(k\) only. The auto-correlation can be represented as a Fourier transform \[ \psi(k) = \dfrac 1 {2\pi} \int_{-\pi}^\pi \cos(k\omega) f_z(\omega)\, d\omega \] In this case \(f_z\) is the power spectral density (PSD) of noise.

To compute the capacity, note that we cannot apply tensorization anymore since the channel is not memoryless. The main idea is to diagonalize the covariance matrix by rotating the input. Effectively, we whiten the channel noise by jointly transforming the input and channel.

It turns out that the eigenvalues of the covariance matrix \((\Sigma_n)_{kl} = \psi_n(k, l)\) has very nice structure, since the covariance matrix is a Toeplitz matrix.

In short, we obtain \[ C(P) = \dfrac 1 2 \int{-\pi}^\pi \log_+ \dfrac{T}{f_z(w)}\, dw, \quad \int |T - f_Z(\omega)|_+\, d\omega = P \]

Metric-entropy

Consider a Hamming game:

  1. Player A writs down 100 bits distributed i.i.d.
  2. Player B looks at the 100 bits, then writes down 50 bits. The original 100 bits are erased.
  3. Player C looks at the written 50 bits, then tries to reconstruct the 100 bits.

Approaches to solving this problem:

  1. Record 50 bits, this yields 50% error.
  2. Compress the 100 bits.
  3. The optimal answer is 11% error.

The key idea is that, due to asymptotic equipartition property, we can stare at typical sequences. The main idea is to try covering the Hamming cube \(\{0, 1\}^{100}\) using \(2^{50}\) Hamming balls with radius as small as possible. The decoding then looks at the indexed sphere and outputs its center.

Nov 25: Rate-distortion theorem

Agenda:

  1. Scalar quantization.
  2. Define \(R(D)\).
  3. Random coding bound.
  4. (Next time): asymptotics.

Scalar quantization

Two change of perspectives in transmitting continuous signals:

  1. PCM (A. Reeves): Binning of continuous numbers.
  2. Sampling (Nyquist): without loss of generality we can consider discrete signals, i.e. discretizing the time axis.

Quantization is the replacement of \(\mathbb R\)-valued things with discrete things. Given a signal \(X_1^n = [X_1, \dots, X_n]\in \mathbb R^n\), map to \(f_S(X_1^n)=\hat X_1^n \in [M]^n\subset \mathbb R^n\). We wish to minimize \[ \mathbb E|X - \hat X|^2 \]

  1. First idea: split into quantiles, this maximizes the entropy of the codeword. However, this turns out not to be the best idea: why?
  2. Second idea: uniform quantization, just split the compact domain into equi-measure components.

For uniform quantization, the limit \(M\to \infty\) makes the conditional distribution of \(X\in \Delta_j\) roughly uniform. Let \(|\Delta_j| = \dfrac{2A}{M}, M=2^R\), we obtain \[\begin{align} \mathbb E|X - \hat X|^2 &= \sum_j \mathrm{Pr}[X\in \Delta_j] \mathbb E_{|X\in \Delta_j} |X - \hat X|^2 \\ &= \sum_{j=1}^M \mathrm{Pr}[X\in \Delta_j] \dfrac{A^2} {3M^2} = \dfrac{A^2}{3M^2} = \dfrac{A^2}{3} 2^{-2R} \end{align}\] The improvement of 1 bit in quantization leads to 4-fold drop (6dB) in rmse.

Panter-Dite ’51 asked the question: what is the best possible way to arrange \(c_j\)’s? Again, the discrete combinatorial problem is hard: even for \(M=4\) and Gaussian the solution is unknown.

  • Fixing a partition \(c_j\), the decision rule is easy: just pick the nearest point!
  • Fixing \(\Delta_j\)’s, the optimal \(c_j\) are the centroids i.e. conditional means \(c_j\)
    • Note that, fixing \(\Delta_j\)’s and computed \(c_j\)’s, the quantization intervals \(\Delta_j\)’s are not optimal anymore. This leads to Lloyd-Max: start with \(c_j\)’s, compute \(\Delta_j\)’s, then compute \(c_j\)’s, and repeat by computing \(\Delta_j\)’s again such that they’re equidistant from the \(c_j\)’s.
    • Lloyd-Max yields the CVT (Centroidal Voronoi Tesselation), and the optimal quantization is always a CVT. Lloyd-Max yields a (not-necessarily optima) CVT.
  • One key idea is to consider the point density \(\lambda\geq 0\) with \(\int \lambda = 1\).

In the continuum limit of the entropy-maximizing method and fixing \(\lambda(x)\), let’s parameterize \(c_j\) in terms of a free parameter \(\lambda_j\) by the following equation: \[\begin{align} c_j &= \int_{-\infty}^{c_j} = \dfrac j m \\ |\Delta_j| \cdot \lambda(c_j) &\approx \dfrac 1 M \implies |\Delta_j| = \dfrac 1 {M \lambda(c_j)} \\ \mathbb E|X - \hat X|^2 &=\sum \mathrm{Pr}[X\in \Delta_j] \mathbb E_{|X\in \Delta_j} |X - c_j|^2 \\ &\approx \sum \mathrm{Pr}[\lambda \in \Delta_j] \dfrac{|\Delta_j|^2}{12} = \sum \mathrm{Pr}[X \in \Delta_j] \dfrac 1 {12M^2} \dfrac 1 {\lambda^2(c_j)} \\ &\approx \dfrac 1 {12M^2} \int p_X(x) \dfrac 1 {\lambda^2(x)}\, dx \end{align}\] This variational problem can be solved to yield \(\lambda^*(x) = c \cdot p(x)^{1/3}\). This generalizes to non-quadratic losses! In other words, to extremize the rmse, we should maximize the entropy (i.e. have uniform distribution) of \(f_X^{1/3}\). It turns out that \[ D_{\mathrm{scalar}}(R) = \dfrac{\pi \sqrt 3}{2} \sigma^2 2^{-2R}, \quad R\gg 1 \] Quantization for quantum systems?

Vector quantization

By vector quantization, we mean i.i.d \(X_j\). For i.i.d Gaussians, we obtain \(1.65\) better in RMSE, or 0.72 bit / sample better (i.e. \(R\mapsto R+0.72\) kills the constant factor). \[ D(R) = \sigma^2 2^{-2R}, \quad \forall R>0 \] Given \(X^n\sim P_X\) i.i.d on \(\mathcal A\) with reconstruction alphabet \(\hat A\), the distortion metric \(d(x, \hat x):\mathcal A\times \hat {\mathcal A}\to \bar{\mathbb R}\). I.i.d. assumption is not as important, but the separability of distortion metric is: \[ d(x^n, \hat x^n) = \dfrac 1 n \sum_{t=1}^n d(x_t, \hat x_t) \] Define an \((n, M, D)\)-code (or vector quantizer) to be the pair \((f, g)\) with \[ X^n\xrightarrow f W\in [M] \xrightarrow g \hat X^n\in \hat{\mathcal A} \] such that \(\mathbb E[d(X^n, \hat X^n)] \leq D\). Every scalar quantizer can be tensorized into a vector quantizer. Similarly, define the fundamental limit \[ M^*(n, D) = \inf_{\exists(N, M, D)-\mathrm{code}} M \] Note that this is subadditive, since \[ \log M^*(n_1+n_2, D)\leq \log M^*(n_1, D) + \log M^*(n_2, D) \] The rate distortion function is defined as the normalized limit of this function \[ R(D) = \lim_{n\to \infty} \dfrac 1 n \log M^*(n, D) \] For the non-separable metric \(d_n(X^n, \hat X^n) = 1_{X^n=\hat X^n}\) we obtain lossless compression. The information version of the distortion rate is \[\begin{align} R^{(I)}(D) &= \lim_{n\to \infty} \dfrac 1 n \inf_{P_{\hat X^n|X^n} | \mathbb Ed_n(X^n, \hat X^n) \leq D} I(X^n; \hat X^n) \end{align}\] For i.i.d. source and separable distortion, applying tensorization yields \[ R^{(I)}(D) = \inf_{\mathbb Ed(X_1, \hat X_1)\leq D} I(X_1; \hat X_1) \] Moreover, the map \(D\mapsto R^{(I)}(D)\) is convex (follows by mixing the rate-achieving channels and applying convexity of MI).

Theorem 12.13 For every \((n, M, D)\)-code, \(\log M \geq n R^{(I)}(D)\).

By definition, \(I(X^n; \hat X^n) \geq R^{(I)}(D)\); on the other hand, the MI is upper-bounded by the maximal entropy of the bottleneck \(\log M\).

The converse bound (analogous to shannon’s noisy coding theorem) is obtained by a similar random coding technique.

Dec 4: Density estimation

Agenda:

  1. Review of metric entropy.
  2. \(\log N \propto 1/\epsilon\).
  3. Density estimation.
  4. Yang-Barron method.
  5. Lower bound.

Rule of thumb: the minimax rate of a class \[ R_n^* \approx \inf_\epsilon \epsilon^2 + \dfrac 1 n \log N(\epsilon) \]

Review of metric entropy

Given \(\Theta\subset V\) a metric space, define the metric entropy \(N(\epsilon; \Theta)\) to be the minimum number of \(\epsilon\)-balls which cover \(\Theta\). This characterizes how “compact” the space \(\Theta\) is.

Similarly, the packing number \(M(\epsilon; \Theta)\) is the maximum number of pointwise \(\epsilon\)-far points contained in \(\Theta\). We have shown that \[ M(\epsilon) \geq N(\epsilon) \geq M(2\epsilon) \] We also considered the class \(\mathcal F_\beta^d(C)\) as the class of pdf’s such that \(f^{\lfloor \beta\rfloor}\) is \((\beta-\lfloor \beta\rfloor)\)-Holder. An \(\alpha\)-Holder function satisfies \(|g(x) - g(y)| \leq C^\alpha |x-y|^\alpha\).

Foundational theorem: for \(L_{p\in [1, \infty]}\)-distances, \(\log N(\epsilon, \mathcal F_\beta^d) = \Omega\, (1/\epsilon)^{d/\beta}\).

One immediate implication is the proof (of non-existence) of Hilbert’s 13-th problem for \(1\)-Lipschitzz problems.

Consider samples \(X_1, \dots, X_n\) drawn i.i.d from \(f\in \mathcal F_\beta^d\) on \([0, 1]^d\), or the collection of \(\lceil \beta\rceil\)-differentiable densities which are \((\beta - \lfloor \beta\rfloor)\)-Holder.

Define the minimax rate to be \[ R_{n, \mathrm{dist}}^*(\mathcal F_\beta^d) = \inf_{\hat f_n:X^n\to \mathcal F_\beta^d} \sup_{f\in \mathcal F_\beta^d} \mathbb E\, \mathrm{dist}(\hat f_n, f) \] Note that \(L_{p\neq 1}\) are nonstatistical as in a homogeneous rescaling of the whole space will change the distance. The foundational result in nonparametric statistics is as follows, \[ R^*_{n, L_p} = \Omega\, n^{-\beta/(2\beta + d} = \left(\dfrac{\log n}{n}\right)^{\beta/(2\beta+d)} \text{ for }p=\infty \] Today we will prove for \(p=2\). Note that for \(p\to \infty\), the rate goes as \(\sqrt{1/n}\), which is a parametric rate!

Proceeding to the proof, define the contracted class \[ F_\beta' = \dfrac 1 2 U + \dfrac 1 2 \mathcal F_\beta \]

We first work with a contracted class with common support.

Lemma 12.3 \(R_{n, L_p}^*(\mathcal F_\beta')\) is the same as that of \(\mathcal F_\beta\) up to a factor of \(2\).

Proof: One direction is trivial: \(\mathcal F_\beta'\subset \mathcal F_\beta \implies R_n^*(\mathcal F_\beta') \leq R_n^*(\mathcal F_\beta)\). Next, given data from \(\mathcal F_\beta\), use \(R_{n, L_p}^*(\mathcal F_\beta')\) to select an estimator for the mixture with \(U\) (which belongs in \(\mathcal F_\beta'\)) and apply Markov’s inequality.

Usually, the bottleneck in proving entropic characterizations is proving \(H^2 \approx D\). Yang-Barron is one of the first applications of information theory to obtain a statistical upper-bound.

Lemma 12.4 \(\forall f, g\in \mathcal F_\beta'\), up to global constants the \(H^2, KL, \chi^2, L_2^2\) are equivalent.

Proof:

  1. First note that \(f, g\geq 1/2\), and there exists a upper bound \(C_{\beta, d}\) such that \(f, g\leq C_{\beta, d}\).
  2. This immediately yields \(\chi^2 \sim L_2\) using the universal constant \[ \chi^2(f\|g) = \int \dfrac{(f-g)^2}{f} \approx \int (f-g)^2 \] The same applies to Hellinger with \((\sqrt x-1)^2 \geq c(x-1)^2\), so \(H^2 \geq c\cdot L_2^2\).

Proceeding to prove the Yang-Barron method, we first use the online-to-batch conversion: given \(X_j\) sampled i.i.d from \(P_{\theta\in \Theta}\) and an estimator \(\hat \theta_n(X_1, \dots, X_n)\).

  1. The batch minimax risk \(R_n^*(\Theta) = \inf_{\hat \theta_n} \sup_{\theta\in \Theta} \mathbb E\, l(\theta, \hat \theta_n)\).
  2. The online (cumulative) risk \[ C_n^*(\Theta) = \inf_{\hat \theta_1, \dots, \theta_n} \sup_{\theta \in \Theta} \sum_{t=1}^n \mathbb E\, l(\theta, \hat \theta_t) \]

Proposition 12.10 \((n+1)R_n^*(\Theta) \leq C_{n+1}(\Theta)\). Moreover, if \(C_n \approx n^{\alpha>0}\), then \(R_n^*\approx n^{\alpha - 1}\).

The main constructive step in Yang-Barron is to take \(\theta\leftrightarrow f\), and \(\hat \theta_t \leftrightarrow \hat f_t(\cdot; X_1, \dots, X_{t-1})\). Take \(l(\theta, \hat \theta_t) = D(f\|\hat f_t)\).

  • The batch loss is exactly the \(L_2\) risk for KL.
  • What about the cumulative loss? \[\begin{align} C_n(\mathcal F) &= \inf_{\hat f_0, \dots, \hat f_{n-1}} \sup_f \sum_{t=0}^{n-1} D(f\|\hat f_t) \\ &= \inf \sup D(f^{\otimes n} \| \hat f_0\times \dots \times \hat f_{n-1}) \end{align}\] But the choice of \(\hat f_0, \dots, \hat f_{n-1}\). The cumulative minimax risk in KL for density estimation is \[ C_n(\mathcal F) = \inf_{Q_{X^n}} \sup_{f\in \mathcal F} D(f^{\otimes n} \| Q_{X^n}) = \sup_{\pi \in \mathcal P(\mathcal F)} I(f; X^n) \] Note that we can upper-bound the capacity using the metric entropy.

The final Yang-Barron estimator is an online exponential-

A recap of the proof sketch:

  1. Reduce \(\mathcal F_\beta\to \mathcal F_\beta'\) to a manageable class with common support.
  2. Demonstrate that on \(\mathcal F_\beta'\), KL and \(L^2\) are equivalent up to universal constants.
  3. Apply the online-to-batch reduction and recognize the cumulative risk in terms of metric entropy.

Proceeding to prove the lower-bund, given a family \(\Theta\), let \(M_H(\epsilon, \Theta)\) be the packing number unfer Hellinger. The capacity \[ C_n(\Theta) \geq \min \left[\dfrac{\log e}{2} n \epsilon^2, \log M_H(\epsilon)\right] - \log 2 \] To prove the lower bound, take \(\theta_1, \dots, \theta_M\) such that \(H^2(P_{\theta_j}, P_{\theta_k}) \geq \epsilon^2\). Recall that there exists hypothesis test for \(P_{\theta_j}^{\otimes n}\) against \(P_{\theta_k}^{\otimes n}\) such that (using TV-\(H^2\) comparison and tensorization of \(H^2\)): \[ P_e \leq \exp \left[ n H^2(P_{\theta_j}, P_{\theta_k})/2 \right] \] To conduct \(M\)-ary hypothesis, take the decoded value \(\hat \theta\) to be the winner of all pairwise tests (else error), then \[ \mathrm{Pr}[\theta \neq \hat \theta] \leq M e^{-n\epsilon^2/2} \] Suppose \(\mathrm{Pr}[\theta \neq \hat \theta]<1/2\), then use Fano’s inequality to lower-bound the mutual information \[ C_n \geq I(\theta; X^n) \geq I(\theta; \hat \theta) \geq \dfrac 1 2 \log M - \log 2 \]

Dec 9. Strong DPI

Agenda:

  1. Wrap-up relation between statistics and information theory.
  2. Definition of SDPI.
  3. Less noisy preorder.
  4. Spiked Wigner model.

Minimax stats

Recall that we wish to estimate \(\theta\in \Theta\) and observe \(X\sim P_\theta\). The metric is \(\mathbb El(\theta, \hat \theta)\) to be minimized.

Density estimation falls under this example: \(\theta\) is a pdf, the observation consists of i.i.d. samples, and the metric is KL, \(H^2\), or TV.

Lower bound: If there exists \(\epsilon\)-estimator \(\hat \theta\) for \(\theta\), then \(I(\theta; \hat \theta)\leq I(\theta; X)\) implies that the metric entropy of \(\Theta \leq\) the capacity of \(\theta\to X\) channel. This is called the M.I.M.

  • The metric entropy of \(\Theta\) can be bounded by a packing of \(\Theta\) in the \(l\)-metric.
  • The capacity of \(\theta\to X\) can be bounded by a covering in KL.

The difficulty of learning (statistically estimating) high-dimensional statistical classes may be captured by metric entropy. The metric entropy gives us a rough estimate of “how hard” it is to estimate a class.

Combinatorial statistics

Recall the DPI graph \((P_X, Q_X)\xrightarrow {P_{Y|X}} (P_Y, Q_Y)\) and \(D(P_X\|Q_X) \geq D(P_Y \| Q_Y)\). Strong DPI aims to prove the existence of \(\eta<1\) such that \[ \eta\, D(P_X\|Q_X) \geq D(P_Y\|Q_Y) \]

Definition 12.6 (SDPI constant) Given \(P_{Y|X}\), define the (input-independent) SDPI constant \[ \eta_f(P_{Y|X}) = \sup \dfrac{D_f(Q_Y\|P_Y)}{D_f(Q_X\|P_X)} \] the supremum is taken over all \(Q_X, P_X\) such that the expression makes sense. The input-dependent SDPI constant \(\eta_f(P_X, P_{Y|X})\) is the supremum of the same ratio over \(Q_X\).

Intuition: \(\eta_{\mathrm{KL}}\) is basically the modified log-sobelev inequalities, and \(\eta_{\chi^2}\) is the spectral gap for Markov chains. We’ll be mostly focused on the input-independent SDPI constant.

Proposition 12.11 The SDPI constant for \(f\)-divergence and \(f\)-MI are equivalent.

How do we show that SDPI is not 1?

Proposition 12.12 (properties of SDPI)

  1. \(\eta_{\mathrm{TV}} = \mathrm{diam}_{\mathrm{TV}} = \sup_{x, x'} \mathrm{TV}(P_{Y|X=x}, P_{Y|X=x'})\).
  2. \(\eta_f \leq \eta_{\mathrm{TV}}\).
  3. \(\eta_{\mathrm{KL}} = \eta_{\chi^2} = \eta_{H^2} = \eta_f\) for \(f\) if \(f\) is twice-differentiable and operator-convex.
  4. \(\eta_{\chi^2} = \sup_{P_X, f, g} \rho^2(f(X), g(Y))\) i.e. maximum correlation coefficient squared.
  5. For \(\mathcal X\) is binary, let \(P_{Y|X=0} = P_0\) and \(P_{Y|X=1}=P_1\), then \[ \eta_{\mathrm{KL}} = \sup_{0<\beta<1} LC_\beta(P_0\| P_1) \]
  6. Up to a coefficient of \(1/2\), \(\eta_{\mathrm{KL}}\) is always given by Hellinger. \[ \dfrac 1 2 H^2(P_0, P_1)\leq \eta_{\mathrm{KL}} \leq H^2(P_0, P_1) \]
  7. For any \(\mathcal X\), the SDPI constant \(\eta_{\mathrm{KL}}\) is equal to that of the worst binary subchannel (Bernoulli on two-atom support).
  8. Corollary: \(\dfrac 1 2 \mathrm{diam}_{H^2} \leq \eta_{\mathrm{KL}}(P_{Y|X}) \leq \mathrm{diam}_{H^2}\).

Proposition 12.13 \(\eta_{KL}(\mathrm{BSC}_\delta) = (1-2\delta)^2\).

To prove this, use \(\eta_{\mathrm{KL}} = 1 - (1 - \frac 1 2 H^2(\mathrm{Ber}_\delta, \mathrm{Ber}_{\bar \delta}))\). Achievability is proven by taking \(\mathrm{BSC}_{\alpha\to 1/2}\) and comparing MI.

Definition 12.7 (less noisy comparison) \(P_{Y|X}\leq_{ln} P_{Z|X}\) if \(\forall P_{U, X}\) in the chain \(U\to X\to (Y, X)\), we have \(I(U; Y) \leq I(U; Z)\).

We also have the following result: \(\eta_{\mathrm{KL}}(P_{Y|X}) = 1 - \tau\iff P_{Y|X} \leq_{ln} \mathrm{Erasure}_\tau\).

Theorem 12.14 (tensorization) The less-noisy definition tensorizes: \[ P_{Y_j|X_j} \leq_{ln} P_{Z_j|X_j} \implies \prod P_{Y_j|X_j} \leq_{ln} \prod P_{Z_j|X_j} \] i.e. given \(I(U; Y_j)\leq I(U; Z_j)\), we wish to show \(I(U; Y_1, Y_2)\leq I(U; Z_1, Z_2)\).

To show this, use the chain \[\begin{align} I(U; Y_1, Y_2) &= I(U; Y_1) + I(U; Y_2|Y_1) \leq I(U; Y_1) + I(U; Z_2|Y_1) = I(U; Y_1, Z_2) \end{align}\]

Dec 11: SDPI, Distributed Estimation

Agenda:

  1. Review SDPI.
  2. Finish Spiked Wigner.
  3. Correlation estimation (Gap Hamming problem in TCS) and distributed mean estimation (machine learning).

Recall the S-DPI constant \[ \dfrac 1 2 \mathrm{diam}_{H^2}\leq \eta_{\mathrm{KL}}(P_{Y|X}) \leq \mathrm{diam}_{H^2}(P_{Y|X}) \] Also recall the less-noisy definition:

  • For all \(P_{UX}\), we have \(I(U; X)\leq I(U; Z)\implies P_{Y|X}\prec_{ln}P_{Z|X}\), so \(Z\) is less noisy than \(Y\).
  • Consider another “natural” definition: forall \(P_X\), we have \(I(X; Y) \leq I(X; Z)\); this says that \(P_{Y|X}\prec_{mc}P_{Z|X}\) i.e. \(Z|X\) is more capable than \(Y|X\).

Why is less noisy better than more capable? First, it’s stronger, and it additionally ensures that not only is the whole input \(X\) “better preseved,” but also any subobject of \(X\). Properties of less noisy comparison:

  1. Tensorization: \(P_{Y|X}\preceq_{ln}P_{Z|X} \iff \forall n: P_{Y|X}^{\otimes n}\preceq_{ln}P_{Z|X}^{\otimes n}\).
  2. \(\eta_{KL}(P_{Y|X})\leq 1-\tau \iff P_{Y|X} \preceq_{ln} EC_\tau\).

Spiked Wigner

Consider an ambient \(W_{jk}=W_{kj} \sim \mathcal N(0, 1)\), and \(X_j\sim \mathrm{Unif}\{\pm 1\}\) for some “signal”, we observe \[ Y = \sqrt{\dfrac\lambda n} XX^T + W \] and estimator \(\hat X(Y)\),

  1. If \(\lambda \leq 1\), then \(\forall X: \dfrac 1 n \mathbb E|\hat X^TX|\to 0\). i.e. if the signal is small, then asymptotically we cannot even learn about planted signals.
  2. If \(\lambda>1\), then for \(\hat X\) being the top eigenvalue of \(Y\), the correlation is bounded away from \(0\): \(\dfrac 1 n \mathbb E|\hat X^TX|\geq \epsilon_0(\lambda)>0\).
    • This follows from random matrix theory.

Recall inference on graphs: consider a graph with vertices \(X_j\sim \mathrm{Unif}\pm 1\). For every edge \(e=(u, v)\), we have access to a channel \(P_{Y_e|X_e}\). For the spiked wigner problem, we have a complete graph and \[ Y_e = \sqrt{\dfrac\lambda n} X_uX_v + \mathcal N(0, 1) \] Consider the channel \(P_{Z_l|X_l} = EC_\tau(X_uX_v)\). How to compute \(I(X_1; X_2|Z_E)\)?

  • If there is an open path (i.e. without erasure) between \(X_1, X_2\), then we know \(X_1X_2\) with certainty.

Unrolling the definition yields \[ I(X_1; X_2|Z_E) = \mathrm{Pr}[\exists\text{open path in } ER(1-\tau, n)] \] Theorem (Erdos-Renyi random graph model): \(ER(\lambda/n, n)\) for \(\lambda<1\) are of size \(O(\log n)\); in this regime, the probability that \(X_1, X_2\) are connected goes as \(\log n / n \to 0\).

For \(\lambda>1\), there exists a giant connected component of size \(\Omega(n)\).

Back to spiked Wigner, we compute the \(\eta_{KL}\) of \((X_1, X_2)\mapsto X_1X_2\mapsto Y=\sqrt{\lambda/n} X_1X_2 = W\), apply less noisy reduction. Retracing the steps:

  1. Start with random matrix theory, we recognized the reduction to graph inference.
  2. Graph inference is easily bounded for the \(EC_\tau\) channel using ER theory.
  3. Reduction to the relatively easy \(EC_\tau\) is done using SDPI theory.

Correlation estimation

Given i.i.d. pairs \((X_j, Y_j)\) drawn from \(P_{XY}\) with uniform marginal on \(\pm 1\) and correlation \(\rho\); equivalently, \(X\xrightarrow{\mathrm{BSC}_\delta} Y\) with \(1-2\delta = \rho\).

Suppose we wish to estimate \(\rho\) between two different partieis holding \(X, Y\) respectively, using \(B\) bits. Claim: \[ \mathbb E(\rho - \hat \rho)^2 = \dfrac 1 {2\log 2} \cdot \dfrac{1+o(1)}{B} \]

  • Contraction coefficient tensorizes (invariant) assuming independent inputs.
  • Removing this constraint, the SDPI coefficient always goes to \(\infty\) due to the existence of error-correcting codes.

Yury’s great quote: popularity is not quality; Macdonalds produce billions of dollars of gross scales, but the Michelin restaurants are not upset about it.