Oct 7: Data compression
Theorem 12.1 (local expansion of χ) For \(\theta_0 = 0\), when the following conditions hold:
- \(\theta \in [0, \tau)\).
- There exists \(\dot p^\theta(x)\) satisfying
\[
P^\theta(x) = P^0(x) + \int_0^\theta \dot p^t(x)\, dt
\]
- The fisher information is defined everywhere
in a connected interval open interval about \(\theta_0\):
\[
\int dx\, \sup_{0\leq t < \tau}
\dfrac{\dot p^t(x)^2}{p^0(x)} < \infty
\]
Then \(\chi^2(P^{\theta_0 + \delta} \| P^{\theta_0}) + \delta^2 \mathcal J_F(\theta_0) + o(\delta^2)\).
The quadratic local behavior of \(\chi^2\) requires
not only finite Fisher information at \(\theta_0\);
it also requires \(\mathcal J_F\) to be finite almost everywhere
on a nontrivial interval.
In summary, \(\chi^2\) is nice for the “good” cases
where \(\mathcal J\) is finite a.e. However, for some
irregular cases it’s best to use
\(H^2(P^{\theta_0 + \delta}, P^\theta)\), whose local
behavior is always determined by the local Fisher information.
A location family is a one-parameter family for which
the parameter controls the displacement of a fixed density \(\nu\),
e.g. \(\mathcal N(\theta, 1)\); for this family,
\(\mathcal J(\theta) = \mathcal J(0)\), which we can abbreviate as
\[
\mathcal J(\nu) = \int \dfrac{(\nu')^2}{\nu} \, dx
\]
Note that Fisher information \(\mathcal J_F\) is originally defined for
a one-parameter family; with the notation \(\mathcal J(\nu)\),
we’re assuming the one-parameter family to be the location family
of the given density.
Theorem 12.2 (van Trees inequality) Under regularity assumptions on \(P^\theta\),
for every estimator \(\hat \theta(X)\) and prior
\(\pi\)
\[
\mathbb E_{\theta \sim \pi} \mathbb E_{X\sim P^\theta} \left[
[\theta - \hat \theta(X)]^2
\right] \geq \dfrac 1 {\mathcal J(\pi) +
\mathbb E_{\theta \sim \pi} \mathcal J_F(\theta)}
\]
Corollary 12.1 For \(n\) i.i.d. observations, for every \(\pi\)
\[
R_n^* \geq \dfrac 1 {\mathcal J(\pi) +
n\mathbb E_{\theta \sim \pi} \mathcal J_F(\theta)}
\]
van Trees is important important because it provides
a lower bound on minimax risk based on
local quantities \(\mathcal J_F(\theta)\).
Two takeaways for information theorists:
- In Bayesian setting, apply information theory techniques
to the joint P_{, X}$.
- In local neighborhoods, \(\chi^2\) satisfies an additive
chain rule (since it’s close to KL).
Van Trees replaced \(\mathcal J_F\) with the expectation of
\(\mathcal J_F\), and it applies to every (instead of unbiased) estimators.
Oct 9: data compression II
Overview:
- Review
- Distribution of fixed length
- Arithmetic encoder
- Stationary, ergodic data sources
- Lempel-Ziv (adaptive, universal compression)
Main takeaways:
- AEP.
- Optimal compression is possible using (1) arithmetic encoding
and (2) optimal next-token prediction.
Information theory is about bounding
hard, intractable operational quantities with mathematically
analyzable quantities.
Review
Assume \(P_X\) known and ordered \(P_X(1)\geq P_X(2)\geq \cdots\).
The optimum compressor \(f^*(x)\) satisfies
\[
l(f^*(X)) = \lceil \log_2 X\rceil
\]
We also proved that the expected
optimum compression length satisfies
\[
\mathbb El(f^*(X)) \approx H(X)
\]
In fact, this is upper-bounded by \(H(X)\).
But this is not an algorithm!
- Sorting the distribution is intractible.
- It’s variable-length but not prefix-free!
We’re assuming one-shot compression
(comma is for free). Howevver, prefix free code,
the optimal expected bound is lower-bounded by \(H(X)\).
- Proof idea: typicality; break region into typical (tractable)
and atypical regions with vanishing probability.
- Asymptotic equipartition property: i.i.d. implies
law of large numbers (in log-space).
Distribution of compression length is distributed (up to constants)
as the entropy density. This is great because entropy density tensorizes.
Proposition 12.3 (distribution of compression length) Define random variable \(L^* = l(f^*(X))\).
Denote the entropy density \(i_X(a) = -\log P_X(a)\);
the optimal compressor should compress symbol \(a\) to roughly
this length:
\[
\mathrm{Pr}[i(X) \leq k] \leq \mathrm{Pr}[L^* \leq K] \leq \mathrm{Pr}[i(X) \leq k + \tau]
+ 2^{1-\tau}
\]
This holds for every \(k\in \mathbb Z_+, \tau>0\).
Proof: Left-bound is easy:
\[
L^*(x) = \lceil \log_2 X\rceil \leq \log 2 |X|
\leq -\log P_X(x) = i_X(x)
\]
To bound the second term, decompose the probability
\[\begin{align}
\mathrm{Pr}[L^* \leq k]
&= \mathrm{Pr}[L^* \leq k, i(x) \leq k+\tau] + \mathrm{Pr}[L^* \leq k, i(x) > k+\tau] \\
&\leq \mathrm{Pr}[i_X(x) \leq k+\tau] + (\cdots)
\end{align}\]
The second term is bounded by the number of strings
which achieves this \(2^{k+1}\) times the maximum probability
they’re obtaining \(2^{-k-\tau}\), yielding \(2^{1-\tau}\).
Corrolary: if, for some sequence of r.v., the normalized entropy
rate converges in distribution to \(U\), then the normalized optimal
compression length (for asymptotically large block length)
also converges to \(U\).
Another corollary: if \(S_j\sim P_S\) i.i.d., then
\[
\dfrac 1 n i_{S_1^n}(S_1^n) = -\dfrac 1 n \log P_{S^n}(S^n)
= -\dfrac 1 n \sum_{j=1}^n \log P_S(s_j) \to H(S)
\]
This implies that the expectation of the optimal compression length
for i.i.d. source \(X\) converges to \(H(X)\) in the limit of asymptotically
large block lengths.
This is a nontrivial result, because the optimal compressor
is a very freely-specified object, but we are able to bound its
behavior very neatly.
In particular, recall that the optimal-compressor maps
highest-probability atoms to the empty set, but we see from the
asymptotic Gaussian distribution of the compression length that
they have almost negligible density. This is a demonstration of the
asymptotic equipartition property, which states that
for i.i.d. sources, the overwhelming number of sequences have
the same probability given by \(e^{H(X)}\).
Arithmetic encoder
Given a general \(\{X_j\}\)-process and wishing to compress \(X_1^n\).
First consider i.i.d process. Order the alphabet and
recursively partition the interval \([0, 1]\) so that
each interval has length equal to its probability.
The trick is to find the largest dyadic interval (recursive
binary partition) that fits inside the interval of the message.
For example:
\[
[0, 1]\to \emptyset, \quad [6\cdot 2^{-3}, 7\cdot 2^{-3}]\to 110
\]
In the limit that the codestring goes to infinity,
the distribution of binary expansion will be uniform.
Fact for the compression length of arithmetic encoder for i.i.d. terms:
\[
\log \dfrac 1 {P_{X^n}(x^n)} \leq l(f_{\mathrm{ae}}(x^n)
\leq \log_2 \dfrac 1 {P_{X^n}(x^n)} + 2
\]
The arithmetic encoder is additionally sequential: it does not
need to consume the full string to start outputting compression;
the same holds for the decompressor.
Implementing the arithmetic encoder for general non-i.i.d.
distributions just replace subsequent intervals by the marginals
\(P_{X_n \| X^{(n-1)}}\).
This means that, if we can sequentially predict the marginal
\(P_{X_n \|X^{(n-1)}}\) very well (e.g. next-token prediction LLM),
then we can close-to-optimal compress a non-i.i.d. distribution
by combining this with the arithmetic encoder.
Theorem 12.3 (Shannon-McMillan-Breimen) A.e.p. holds w.r.t. the entropy rate
if \(\{X_j\}\) is a stationary (ensures
existence of the entropy rate) ergodic process.
Shannon’s proof:
every stationary ergodic process can be arbitrarily
approximated by a \(m\to \infty\)-order Markov chain.
Lempel-Ziv
Central question: how to compress well without \(P_{X^n}\)?
Lemma 12.1 (Katz's lemma) Given a stationary ergodic process \((X_{j\in \mathbb Z})\).
Define \(L = \inf\{t>0: X_{-t}=X_0\}\), then
\[
\mathbb E[L | X_0=u] = \mathrm{Pr}[X_0=u]^{-1}
\]
Proof: consider the probability that we don’t see \(u\)
when we look back for \(k\) steps, using stationarity:
\[\begin{align}
\mathrm{Pr}[L>K, X_0=u]
&= \mathrm{Pr}[X_0=u, X_{-1}\neq U, \cdots, X_{-k}\neq u] \\
&= \mathrm{Pr}[X_k=u, X_{k-1}\neq U, \cdots, X_0\neq u]
= \mathrm{Pr}[E_k]
\end{align}\]
todo Another key point is that for stationary ergodic
processes, \(\mathrm{Pr}[\bigcup E_k] = 1\).
Using Katz’s lemma, we can do an unbiased estimation of \(\mathrm{Pr}[X_0=u]\)
by looking back.
Do probability mixtures satisfy all of the local Fisher
regularity conditions?
Oct 28: Binary hypothesis testing
Agenda:
- Finish universal compression.
- Define binary hypothesis testing; define \(R(P, Q)\).
- Stein’s lemma.
More on universal compression
Recall that universal compression wants to compress
\(P_{X^n}\) to entropy, where \(P_{X^n}\) is unknown
and drawn from a class \(\Pi\).
The fundamental limit of compression is
the redundancy of class \(\Pi\), defined by
\[
R_n^*(\Pi) = \inf_{Q_{X^n}}
\sup_{P_{X^n}\in \Pi} \mathbb E_{X^n\sim P_{X^n}}
\log \dfrac 1 {Q_{X^n}(X^n)} - H(X^n)
\]
Note the order of \(\inf\) and \(\sup\)! Else
the “redundancy” would be trivial.
It turns out that for the uniform mixture of
all Bernoullis approximately, for all \(\theta\).
\[
Q_{X^n} = \int_0^1 d\theta\, \mathrm{Ber}(\theta)^{\otimes n}
\]
satisfies \(D(\mathrm{Ber}(\theta)^{\otimes n} \| Q_{X^n}) \leq \log n\)
To see this, for non-extremal \(\theta, \theta'\) we have
\[
d(\theta \| \theta') \approx \dfrac 1 2 (\theta - \theta')^2
\]
The distribution \(Q_{X^n}\) is the universal probability assignment.
(or maximally ignorant
Universal data compression (probability assignment) \(\approx\)
choosing the best prior.
The most important theorem: redundancy is equal to the following capacity:
\[
R_n^*(\Pi) = \sup_{\pi \in \mathcal P(\Pi)} I(\theta; X^n)
\]
Jeffrey’s (objective) prior: given smoothness conditions for a family \(\Theta\),
we have
\[\begin{align}
R_n^*(\Theta)
&= \dfrac d 2 \log \dfrac{n}{2\pi e} + \log \int_{\Theta}
\sqrt{\det \mathcal J_F(\theta)}\, d\theta + o(1) \\
\pi_n^* &\to \pi_\infty^* \text{ weakly }
= \dfrac 1 Z \sqrt{\det \mathcal J_F(\theta)}\, d\theta
\end{align}\]
The objective prior is invariant to
reparameterizations. Importantly, for Bernoullis the Jeffreys prior is
\[\begin{align}
\pi^*(\theta) = \dfrac 1 {\theta(1 - \theta)}\quad k=2, \text{ else }
\mathrm{Direchlet}(1/2, \dots, 1/2).
\end{align}\]
The marginal estimator \(Q_{X_t|X_1^{t-1}}\) turns out to correspond
to the add-half estimator (KT code).
Connection between universal compression and sequential prediction:
Recalling the setup of sequential prediction, for each step the predictor
is penalized by the negative log likelihood (with minimizer given by the true
distribution). From learning theory, the natural metric for this problem
(of a complexity of a class \(\Pi\)) is
\[
\sup_{P_X\in \Pi} \mathbb E\sum_{t=1}^n
\log \dfrac 1 {\tilde P_t(X_t; X_1^{t-1})}
- \log \dfrac 1 {P_{X_t|X_1^{t-1}}}
\]
If some estimator achieves \(o(n)\) for the preceding quantity,
then the class is learnable since the estimator is asymptotically as
good as the oracle. It turns out that
\[
\inf_{\tilde P} \text{regret} = R_n^*(\Pi).
\]
Relation to in-context learning: training phase allows for representation
for different generating procedures (e.g. Jeffreys prior),
and context chooses the posterior which specializes to the suitable
data-generating process.
Every regret-minimizing estimator is an in-context learner.
This is related to a theorem for density estimation (related to the
Young-Barron bound):
given \(X_1, \cdots X_n\sim P\in \Pi\) i.i.d, we wish to find \(\tilde P\)
which will estimate \(P\) very well.
Theorem 12.4 For every i.i.d. class \(\Pi\) as above, there exists estimator such that
\[
\mathbb E\mathrm{TV}^2 \leq \mathbb ED(P\|\tilde P_{\text{this is random}})
\leq \dfrac{R_{n+1}^*(\Pi)}{n+1}
\]
This is a very insightful theorem since the right-hand side quantity
can be upper-bounded by the redundancy theorem.
Universal data compression (information)
reduces to the problem of universal probability assignment,
which is simultaneously the core of density estimation (statistics)
and sequential prediction (learning).
Minimum description length principle: the more likely
(class) of data-generating process is the one under whose universal
compressor compresses the data down to a smaller length; this
is equivalent to point-point hypothesis testing after collapsing
the classes down to their universal prior.
Binary hypothesis testing
We are interested in simple v. simple hypothesis testing i.e.
distinguishing beween \(\theta_1\neq \theta_2\in \Theta\) (Neyman-Pearson);
equivalently, distinguishing between \(P_{X^n}\) and \(Q_{X^n}\).
Our goal is to design a binary estimator \(Z=Z(X^n)\in \{0, 1\}\), with
\(0\) denoting \(X^n\sim P\) and \(1\) denoting \(X^n\sim Q\)
(the first serious treatment originated with radars detecting incoming planes).
Two associated quantities:
- \(\alpha = P[Z=0]\): probability of success under \(P\).
- \(\beta = Q[Z=0]\): probability of failure under \(Q\).
Theorem 12.5 (Stein's lemma) Among all tests with \(\alpha \geq 1 - \epsilon\), the smallest possible
\(\beta = e^{-nD(P\|Q) + o_\epsilon(n)}\) assuming \(X_j^n\sim P\) or \(Q\).
Several comments:
- This theorem is not symmetric w.r.t. \(P\leftrightarrow Q\).
- The dependence upon \(\epsilon\) is captured in \(o_\epsilon(n)\).
To reason about binary hypothesis testing, define
\[
\mathcal R(P, Q) = \{(P[Z=0], Q[Z=0])\}
\]
Proposition 12.4 \(\mathcal R\subset [0, 1]^2\) is closed, convex, symmetric
w.r.t. \((1/2, 1/2)\), and
\[
\mathcal R = \overline{\mathrm{co}\,
\mathcal R_{\mathrm{det}}(P, Q)
}
\]
where \(\mathcal R_{\mathrm{det}}\) denotes the set of deterministic
joint ranges.
As a corollary, the problem can be simplified by noting that
\[
\mathcal R_{\mathrm{det}} = \{(P[E], Q[E]), \quad E\in \mathcal X\}
\]
For the Bernoulli case, \(E\) takes values in \(\{\emptyset, 0, 1, \mathcal X\}\).
The first and last cases correspond to \((0, 0)\) and \((1, 1)\).
Perspective on the proof of the NP-lemma:
to understand \(\mathcal R_{\mathrm{det}}\), it suffices to understand
its extremal points, which corresponds to the problem
\[
\sup_Z P[Z=0]_{=\alpha} - e^\gamma Q[Z=0]_{=\beta}
\]
The parameterization \(e^\gamma\) simply assures positivity.
We “scan” over the supporting hyperplanes with a certain offset.
The upshot is that BHT boils down to computing \(P[\log P/Q > \gamma]\)
and \(Q[\log P/Q < \gamma]\) for a range of \(\gamma\)’s.
The “fatter” this joint range is, the more distinguishable the
two distributions are; identical distributions correspond to the
line \(\alpha=\beta\) with zero measure. This motivates the
following results:
Proposition 12.5 \(d(\alpha \| \beta) \leq D(P\|Q)\) for \(\alpha, \beta \in \mathcal R\).
This results from applying DPI to the estimator \(Z\).
Nov 4. Channel coding
Agenda:
- Finish Sanov’s theorem, large deviation theory, and I-projection.
- Define error-correction codes; example using BSC.
- Weak converse bound.
BHT, large deviations theory
To properly understand hypothesis testing, we need to
understand the large-deviations question:
to compute \(E_0, E_1\), we need to compute \(E\) in
\[
\mathrm{Pr}[\bar T < \gamma] \sim e^{-nE(\gamma)}
\]
Theorem 12.6 (Sanov principle) under regularity conditions on \(\mathcal E\), we have
\(\mathrm{Pr}[\hat P_n\in \mathcal E] = e^{-nE + o(n)}\), where
\[
E = \inf_{R\in \mathcal E} D(R\|P)
\]
Extremely important proof below!
Lemma 12.2 If \(\mathcal X\) is finite and \(\mathcal E\) is convex with
nonempty interior, then Sanov holds.
First consider an upper bound: \(\mathrm{Pr}[\hat P_n\in \mathcal E] \leq e^{-nE}\).
By \(P_{X^n} = P^{\otimes n}\), define \(\tilde P_{X^n} = P_{X^n | E_n}\).
Note that the KL divergence
\[
D(\tilde P_{X^n} \| P_{X^n})
= \mathbb E_{P_{X^n}} \log \dfrac{P_{X^n|E_n}(x)}{P_{X^n}(x)}
= \log \dfrac 1 {P_{X^n}(E_n)}
\]
This is the extension of Theorem 11.9. This implies a lot of bounds!
Chernoff, Berstein, Benett, Okamota, etc. In particular,
\[
\forall k, \quad \mathrm{Pr}[\mathrm{Bin}(n, p)\geq k] \leq \exp \left[-n d(d/n\|p)\right]
\]
Idea to proving lower bounds: prove for the easy case,
then lift to the general case using information measures.
To prove the lower bound, if \(P\in \mathrm{int}(E)\),
then using the law of large numbers yield \(\mathrm{Pr}[\hat P_n\in \mathcal E]\to 1\).
Otherwise, take \(R\in \mathrm{int}(\mathcal E)\), then
\(R^{\otimes n}[\hat P_n\in \mathcal E]\to 1\). Apply DPI on the indicator \(1_{E_n}\).
I-projections
Take \(X\in \mathbb R^d\) samples from \(X\sim P\); define the information projection by
\[
I(\gamma) = \inf D(R\|P)
\]
where infimum is over \(\mathbb E_P[X]=\gamma\).
Proposition 12.6 (tilting is optimal if exists) Suppose there exists \(\lambda \in \mathbb R^d\) such that
\(\mathbb E_{X\sim P_\lambda}[X]=\gamma\), then
for every \(R\) satisfying \(\mathbb E_R[X]=\gamma\), we have
\[
D(R\|P) = D(R\|P_\lambda) + D(P_\lambda \| P) \geq D(P_\lambda \| P).
\]
Proof: \(D(R\|P) = \mathbb E_R \log \dfrac{dR}{dP} \dfrac{dR}{dP_\lambda}\) and regroup
by cross-terms. Recognize \(\mathbb E_R \log dP_\lambda/dP = \mathbb E_R[\lambda^T X - \psi(\lambda)]\).
Theorem 12.7 (I-projection on a hyperplane)
- If \(\exists R: \mathbb E_R[X]=\gamma\) and \(D(R\|P)<\infty\),
then \(\gamma\in \mathrm{csupp}(P)\).
- \(\exists P_\lambda\) with \(\mathbb E_R[X]=\gamma \iff \gamma\in \mathrm{int}(\mathrm{csupp}\, P)\).
Here the convex support of \(P\) is the intersection of all sets \(S\)
such that \(P[S]=1\) and \(S\) is closed and convex; to visualize this, imagine
the mixture of two disjointly supported distributions.
In other words, information projection is solvable iff \(\gamma\) is in
csupp, and it is solvable by tilting if it \(\gamma\) is in the interior.
Theorem 12.8 (tradeoff in error exponents) \(E_1(E_0) = \inf_{D(R\|P)<E_0} D(R\|Q)\).
Relation to Wald’s SPRT principle by slightly
modifying the problem formulation: commit to an expected
instead of exact number of samples. For composite v.s.
composite tests, Hellinger computation becomes important.
Hypothesis testing is still a thriving topic!
Instead of specifying the exact generating distributions
\(P, Q\), only samples are provided (likelihood-free inference).
Channel coding
We started with information measures, next found
in compression that compression is described by entropy;
consequently, the optical error decay in hypothesis testing
is provided by divergence. The problem of channel coding
is solved by mutual information.
Given a channel \(P_{Y|X}:\mathcal X\to \mathcal Y\), we wish to
construct \(f:[M]\to \mathcal X\) and \(g:\mathcal Y\to [M]\cup \{e\}\)
such that the maximal error rate (across all \(j\)) \(\leq \epsilon\).
For generality, allow randomized \(f, g\).
\[
W\xrightarrow f X\xrightarrow{P_{Y|X}} Y\xrightarrow g \hat W
\]
This is a maximal probability of error code, in comparison to
average probability of error codes.
- If \(f\) is given, the design of \(g\) is essentially \(M\)-ary H. T.
For \(M>2\) there is no uniquely optimal criteria.
However, if we postulate \(j\sim [M]\) uniformly, then
the optimal solution is given by maximum-likelihood decoder.
Note that the maximum-likelihood decoder tesselates the
decoder-input space according to 1-nearest neighbor.
In information theory, almost all of the work is designing a great distribution.
Nov 6, Channel coding II
- \(M^*(\epsilon)\).
- Define a channel (again); fundamental limit of
finite-blocklength coding, and capacity.
- Weak converse.
Recall our definition of a channel last time:
- Channel \(P_{Y|X}: \mathcal X\to \mathcal Y\).
- Encoder \([M]\to \mathcal X\), decoder \(\mathcal Y\to [M]\cup \{e\}\).
Graphically, this can be represented as
\[
W\xrightarrow f X
\xrightarrow {P_{Y|X}}
Y\xrightarrow g \hat W
\]
We need to put some assumptions on \(W\),
so let’s take \(W\sim \mathrm{Unif}[M]\).
Overwhelmingly, the encoder
\(P_{X|W}\) will be deterministic, and
most of the times \(P_{\hat W|Y}\) is deterministic;
one does sometime need to randomize when minimizing the
minimax risk, however.
Definition 12.1 (error criteria) Consider two types of the probability of error reminiscent
of the case for risk in statistical estimation:
- Average probability of error: \(P_{e, \mathrm{avg}} = \mathrm{Pr}[\hat W\neq W]\).
- Maximum probability of error:
\(P_{e, \mathrm{max}} = \max_{j\in [M]} \mathrm{Pr}[\hat W\neq j|W=j]\).
We say that \((f, g)\) is an \((M, \epsilon)\)-code if
if \(P_{e, \mathrm{avg}}\leq \epsilon\). It is an
\((M, \epsilon)_{\mathrm{max}}\) code if \(P_{e, \mathrm{max}}\leq \epsilon\).
Definition 12.2 (single-shot fundamental limit) Fixing a channel, \(M^*(\epsilon)\) is the maximum alphabet size
such that there exists a \((M, \epsilon)\)-code.
The max-error limit \(M^*_{\mathrm{max}}\) is defined similarly.
Here \(M^*(\epsilon)\) is related the number of messages one can
reliably separate. In general, this is a computationally
intractable quantity, but we will be able to bound it quite tightly.
Recall from last time that, given \(f\), the optimal \(g\)
which minimizes \(P_{e, \mathrm{avg}}\) is the MLE. This follows
from the MLE coinciding with MAP under the uniform prior.
- If \(f, g\) are deterministic, then \(f(j)=c_j\in \mathcal X\)
is a codeword. An encoder can be equivalently be viewed
as a collection of codewords.
- The pre-images \(g^{-1}(j)=D_j\subset \mathcal Y\) are
the decoding regions.
When \(\mathcal X=\mathcal Y\), one should think of codewords as
little singleton “points” in space, while the decoding
regions are partitions of the space which attempts to best
separate the codewords in such a way as to be robust to
noise perturbations.
Putting two codewords in space is equivalent to binary
hypothesis testing; if we require \(\epsilon\to 0\),
then \(M^*(\epsilon)\to 0\).
Theorem 12.9 (Fano's bound) \(\log M^*_{\mathrm{max}}(\epsilon) \leq \log M^*(\epsilon) \leq \dfrac{\sup_X I(X; Y) + \log 2}{1-\epsilon}\).
Proof: Again, to prove impossibility bounds
begin with the easy case.
Recall \(W\to X\to Y\to \hat W\), consider
\(Q_{Y|X}=Q_Y\) which does not look at its input, then
\[
Q[W=\hat W] = \dfrac 1 M \sum_j Q[\hat W = j] = \dfrac 1 M
\]
Under the \(Q\)-channel, then \(1-\epsilon_Q = \dfrac 1 M\).
Apply DPI to lift the statement about \(Q\)-channel to
\(P\)-channel, process \((W, X, T, \hat W)\) with \(1_{W=\hat W}\)
to obtain \(\mathrm{Ber}(1/M)\) and \(\mathrm{Ber}(1-\epsilon)\). Then
\[
D(P_{WXY\hat W} \| Q_{WXY\hat W})
\geq d(1-\epsilon \| 1/M)
\]
Noting the Markov structure, apply the chain rule to obtain
\[\begin{align}
D(P_{WXY\hat W} \| Q_{WXY\hat W})
&= D(P_W\|Q_W)_{=0} + D(P_{X|W} \| Q_{X|W} \| P_W)_{=0}
+ D(P_{Y|X} \| Q_Y | P_X) + D(P_{\hat W|Y} \| Q_{\hat W|Y} | P_Y)_{=0}
\end{align}\]
The zeroed terms are due to us only replacing the \(X\to Y\) part
of the Markov chain, then \(\forall Q\), we obtain
\[
(1-\epsilon)\log M \leq D(P_{Y|X} \| Q_Y | P_X) + \log 2
\]
Take \(\inf\) over \(Q_Y\) and apply the golden formula to
obtain mutual information. Note that \(P_X=f\circ U_{[M]}\)
is usually hard to compute, so take \(\sup_X\) to obtain the
desired inequality.
In information theory, a second definition of a
channel is a triplet consisting of
(1) (single-letter) alphabet \(\mathcal A\),
(2) output alphabet \(\mathcal B\), with (3)
a family of Markov kernels \(P_{B^n|A^n}\) for each \(n\geq 1\).
In this sense a channel is interpreted as
a sequence of Markov kernels, and \(n\)
is the block length.
Two examples of additive channels:
1. Gaussian channel: \(\mathcal A=\mathcal B=\mathbb R\), and
\(P_{B^n|A^n} = \mathcal N(0, \sigma^2 I_n)\).
2. Binary symmetric channel \(\mathrm{BSC}_\delta\):
\(\mathcal A=\mathcal B=\{0, 1\}\), and
\(P_{B^n | A^n}(b^n|a^n) = \delta^d(1-\delta)^{n-d}\),
where \(d=d_M(a^n, b^n)\) is the Hamming distance between
the two strings.
- Equivalently, \(B^n = A^n + Z^n\mod 2, \quad Z_j\sim \mathrm{Ber}(\delta)\) i.i.d.
More generally, a discrete stationary memoryless channel (DMC)
is just a tensorized channel which acts i.i.d on the input components.
Definition 12.3 (fundamental limit of channel coding, capacity) For any channel, an \((M, \epsilon)\) code
for \(P_{B^n|A^n}\) is called a \((M, n, \epsilon)\) code.
The capacity \(C\) of the channel is defined as
\[\begin{align}
C_\epsilon &= \liminf_{n\to \infty} \dfrac 1 n \log M^*(n, \epsilon), \quad
C = \lim_{\epsilon \to 0^+} C_\epsilon.
\end{align}\]
Proposition 12.7 (almost no difference between max v.s. average) \(\forall \tau \in (0, 1)\) we have
\[
(1-\tau)M^*(n, \tau \epsilon) \leq M^*_{\mathrm{max}}(n, \epsilon)
\leq M^*(n, \epsilon)
\]
Proof: The second inequality is trivial. For the first,
take an \((n, M, \tilde \epsilon)\) code and define
\(\lambda_j = \mathrm{Pr}[\hat W\neq j|W=j]\), then
\[
\dfrac 1 M \sum_j \lambda_j \leq \tilde \epsilon \implies
\text{ number of $j$ s.t. } \lambda>\tilde \epsilon/\tau
\text{ is less than } \tau M
\]
For every average-prob error code, we can apply the same idea
behind Markov’s inequality to select a max-probability sub-code.
Corollary 12.2 \(C_\epsilon \geq C_\epsilon^{\mathrm{max}} \geq \lim_{\delta \to 0^-} C_{\epsilon - \delta} \implies C^{\mathrm{max}} = C\).
One perspective on \(C\) is that it is the maximal rate of
asymptotically reliable communication. In other words,
it is the supremum over all rates \(R\) such that there exists a
sequence of \(\delta_n, \epsilon_n\to 0\),
\((n, e^{n(R-\delta_n)}, \epsilon_n)\)-codes.
Proof: exercise??!
Definition 12.4 (information capacity) The information capacity \(C^{(i)}\) of a channel is
\[
C^{(i)} = \liminf_{n\to \infty} \dfrac 1 n
\sup_{P_{A^n}} I(A^n; B^n)
\]
Compared to the actual capacity
(integer programming problem), this is easy!
Proposition 12.8 (capacity of BSC) For \(\mathrm{BSC}_\delta\), we obtain \(C^{(i)} = \log_2 - h(\delta)\).
Proof: upper-bound by tensorization of mutual information.
Theorem 12.10 (relation between information and actual capacity) Fano’s bound implies
\(C_\epsilon \leq \dfrac{C^{(i)}}{1-\epsilon} \implies C\leq C^{(i)}\).
For the BSC channel, the optimal decoder \(g_{\mathrm{MLE}}\)
just looks for the
codeword which is the fewest number of flips away from the input.
Approaching this from a different perspective, the average
number of flips for transmitting \(c_1\in \{0, 1\}^n\) is \(n\delta\);
concentration becomes sharp as \(n\to \infty\). This motivates
the construction of the bounded distance encoder
\[
g_{\mathrm{BD}}(y) = \begin{cases}
1 & \text{if $c_j$ is a unique codeword in $B_r(y)$} \\
e & \text{otherwise}
\end{cases}
\]
Shannon had the idea to construct random codewords
\(c_1, \dots, c_M\) uniformly from the Hamming cube. Then
\[\begin{align}
\mathbb E\epsilon_2(c_1, \dots, c_M)
&\leq \sum_{j=2}^M \mathrm{Pr}[d_M(c_j, Y) \leq r | X=c_1]
= (M-1) \mathrm{Pr}[d_M(c_1, Y) \leq r] \\
&= \dfrac{M-1}{2^n} | B_r(0)|
\end{align}\]
Note crucially that \(C_j \perp\!\!\!\perp Y\). The second
error rate \(\epsilon_2\) vanishes if \(M\ll 2^{nC^{(i)}}\).
This implies, for BSC, \(C = C^{(i)} = \log 2 - h(\delta)\).
Nov 18, Channel Coding III
Agenda:
- Finish DT bound.
- DT bound with input constraints.
- General channels with input constraints.
- Capacity of Gaussian channels.
TODO for notes:
- Equivalence between maximum and average error.
- DT bound.
Other questions:
- More intuition for the strong v.s. weak converse.
- Explain \(\liminf\), also why is \(C_0\) a combinatorial problem,
different from \(C_{\to 0}\).
- Wasserstein distance.
Pset questions:
- Pointwise maximization.
- Existence of the extension.
- Idea for applying the meta-converse.
Recall the information density:
given a joint distribution \(P_{XY}\), the information
density is the joint.
For the BSC, the channel only gives half the required
quantitites for computing information density (i.e. the channel).
We still need the input distribution. Take \(P_{X^n}\) to uniform
on the Hamming cube, then
\[
i(X^n; ^n) = \log \dfrac{P_{Y^n|X^n}}{P_{Y^n}}
= \log \dfrac{\delta^{d_M} \bar \delta^{-n-d_M}}{2^{-n}}
= n\log (2\bar \delta) - d_M \log \dfrac{\bar \delta}{\delta}
\]
This makes explicit the connection that ML decoder is equivalent
to minimizing the Hamming distance \(d_M\).
- For memoryless channels, the information density tensorizes.
- This allows us to apply LLN (or large-deviations theory).
We can generalize the notion from stationary memoryless channels,
since information stability (tensorization and concentration about average)
is sufficient to prove capacity.
Theorem 12.11 (random coding DT bound) \(\forall P_{Y|X}\) and \(P_X, M\geq 1\), there exists \((M, \epsilon)\)-code
such that
\[
\epsilon \leq \mathbb E\exp -\max\left[0, i(X; Y) - \log \dfrac{M-1}{2}\right]
\]
Proof: Treating the codebook as a random variable, a decoding error of message \(j\)
can happen if:
- \(i(C_j; Y) \leq \gamma\): the info density of the true codeword is not enough.
In BSC, this is probability that Hamming distance between true message and
received message is larger than the threshold.
- There exists another codeword which exceeds \(\gamma\) prior to \(j\). Since we
are taking the average over random codebooks, this terms becomes tractable.
In BSC, this corresponds to some other codewords existing in the thresholded ball.
Proceeding to bound the first term,
\[
P_e = \sum_{x, y} P_{XY}(x, y) 1 \left\{
P_{XY} \leq e^\gamma P_{\bar X Y}
\right\} + \dfrac{M-1}{2} P_{\bar XY}(x, y) 1\left\{
P_{XY} > e^\gamma P_{\bar XY}
\right\}
\]
To minimize this, the optimal selection is \(\gamma = \log \dfrac{M-1}{2}\);
this construction of events selects the minimum of the two contributions.
Perspective on information theory: when proving results, complexity do not
matter. When results are significant, someone will find a computationally feasible way : )
Decoding with constraint
The capacity of the AWGN without any constraint is \(\infty\),
since we can space inputs out significantly large enough so that
the added noise cause negligible overlap.
The constraint \(\|C\|^2 \leq nP\) yields finite capacity \(C=\dfrac 1 2 \log(1+P)\).
We wish to find a codebook \(c_1, \dots, c_M\) such that \(P_e(C) \leq \epsilon\),
and that \(\|C_j\|^2 \leq nP\). In short, we have to place our codeword
in a sphere of radius \(\sqrt{nP}\). Each codeword will be roughly
spent to a sphere of radius \(\|\mathcal N(0, I_n)\| \approx \sqrt n\).
The output will land with high probability within the sphere with radius
\[
\mathbb E\|C + \mathcal N(0, I_n)\|^2 \approx n(1+p)
\]
How many spheres \(\sqrt n B^{n-1}\) can we pack inside \(\sqrt{n(1+p)}B\)?
An upper bound is \((1+p)^{n/2}\).
Definition 12.5 (channel with constraints) Given a channel and a cost \(c:\mathcal X\to \mathbb R_+\),
an encoder-decoder \((f, g)\) is an \((n, M, \epsilon, P)\) code if
\[
\dfrac 1 n \sum_j \mathrm{Pr}[g(y)\neq j|X=f(j)] \leq \epsilon, \quad
\forall j, c_n(f(j)) \leq nP
\]
Note that we are amortizing w.r.t. blocklength and randomness of the encoder,
but the constraint should be uniformly obeyed by all elements of the alphabet.
For AWGN, \(c(x)=x^2\). Further, for memoryless channels, the constrained
capacity is
\[
C^{(I)}(P) = \sup_{\mathbb Ec(X_1)\leq P} I(X_1; Y_1)
\]
Proposition 12.9 (properties of capacity-cost function)
- \(P\mapsto C^{(I)}(P)\) is concave, increasing, and continuous on interior.
- \(C(P)\leq C^{(I)}(P)\).
Proof: Concavity follows by averaging the capacity-achieving distributions.
The second part follows from applying the weak converse.
Take a \((n, M, \epsilon, P)\)-code, .
Nov 20. Quantization
Agenda:
- Finish capacity of Gaussian cahannel with cost
- Quantization: scalar case
- Definition of vector case
Channel coding with constraint:
\[
C^{(I)}(P) = \liminf_{n\to \infty} \dfrac 1 n \sup_{
\mathbb E\left[\frac 1 n \sum c(X_t)\right]
} I(X^n; Y^n)
\]
By (1) the concavity of the capacity-cost function, and
(2) tensorization of MI, we can tensorize both the extremized
quantity as well as the condition.
Due to the concavity of the capacity-cost function,
maximum communication per unit cost is achieved by having vanishing rate
since \(C^{(I)}(P) / P\) is maximized \(P\to 0\).
Theorem 12.12 (Fano's inequality under constraint) \(C(P) \leq C^{(I)}(P)\).
The information stability criterion
\[
\dfrac 1 n i(X^n; Y^n)\xrightarrow{\mathbb P} C^{(I)}(P)
\]
is equivalent to demanding the equivalent analogue of
law of large numbers holding for non-stationary channels.
For AWGN, we have \(C^{(I)}(P) = \dfrac 1 2 \log(1+P)\),
and recalling our previous discussion,
\[
\sup_P \dfrac{\frac 1 2 \log(1+P)}{P} = \dfrac{\log e}{2}
= -1.6\mathrm{dB} \dfrac{\mathrm{bit}}{\mathrm{Joule}}
\]
is the a fundamental limit in deep-space communication.
Water-filling
Consider a non-stationary memoryless Gaussian channel
\[
Y_t = X_t + Z_t, \quad Z_t \sim \mathcal N(0, \sigma_t^2)
\]
Using tensorization, we obtain
\[
I(X^n; Y^n) \leq \sum_t I(X_t; X_t + Z_t)
\]
with inequality saturated by independent Gaussians.
Fixing the power allocation \(P_t\) for each letter, we obtain
\[
I(X^n; Y^n) \leq \dfrac 1 2 \sum_{t=1}^n \log \left(
1 + \dfrac{P_t}{\sigma_t^2}
\right) \text{ subject to } \sum P_t \leq n P
\]
This is a convex problem. Using Lagrange multipliers,
we obtain
Consider a non-white Gaussian noise channel
\(Y_t = X_t + Z_t\), where \(Z_1, \dots, X_n, \dots\) is a
zero-mean stationary Gaussian process, which means:
- \(\mathbb EZ_t^2 = \sigma^2, \forall t\).
- \(\mathbb E[Z_t Z_{t+k}] = \psi(k)\) is a function of \(k\) only.
The auto-correlation can be represented as a Fourier transform
\[
\psi(k) = \dfrac 1 {2\pi} \int_{-\pi}^\pi \cos(k\omega) f_z(\omega)\, d\omega
\]
In this case \(f_z\) is the power spectral density (PSD) of noise.
To compute the capacity, note that we cannot apply tensorization
anymore since the channel is not memoryless. The main idea is to
diagonalize the covariance matrix by rotating the input. Effectively,
we whiten the channel noise by jointly transforming the input and channel.
It turns out that the eigenvalues of the covariance matrix
\((\Sigma_n)_{kl} = \psi_n(k, l)\) has very nice structure, since
the covariance matrix is a Toeplitz matrix.
In short, we obtain
\[
C(P) = \dfrac 1 2 \int{-\pi}^\pi \log_+ \dfrac{T}{f_z(w)}\, dw, \quad
\int |T - f_Z(\omega)|_+\, d\omega = P
\]
Metric-entropy
Consider a Hamming game:
- Player A writs down 100 bits distributed i.i.d.
- Player B looks at the 100 bits, then writes down 50 bits.
The original 100 bits are erased.
- Player C looks at the written 50 bits, then tries to reconstruct the 100 bits.
Approaches to solving this problem:
- Record 50 bits, this yields 50% error.
- Compress the 100 bits.
- The optimal answer is 11% error.
The key idea is that, due to asymptotic equipartition property, we can
stare at typical sequences. The main idea is to try covering the
Hamming cube \(\{0, 1\}^{100}\) using \(2^{50}\) Hamming balls with
radius as small as possible. The decoding then looks at the indexed sphere
and outputs its center.
Nov 25: Rate-distortion theorem
Agenda:
- Scalar quantization.
- Define \(R(D)\).
- Random coding bound.
- (Next time): asymptotics.
Scalar quantization
Two change of perspectives in transmitting continuous signals:
- PCM (A. Reeves): Binning of continuous numbers.
- Sampling (Nyquist): without loss of generality
we can consider discrete signals, i.e. discretizing the
time axis.
Quantization is the replacement of \(\mathbb R\)-valued things with discrete things.
Given a signal \(X_1^n = [X_1, \dots, X_n]\in \mathbb R^n\), map to
\(f_S(X_1^n)=\hat X_1^n \in [M]^n\subset \mathbb R^n\). We wish to minimize
\[
\mathbb E|X - \hat X|^2
\]
- First idea: split into quantiles, this maximizes the entropy of the codeword.
However, this turns out not to be the best idea: why?
- Second idea: uniform quantization, just split the compact domain
into equi-measure components.
For uniform quantization, the limit \(M\to \infty\) makes the conditional
distribution of \(X\in \Delta_j\) roughly uniform. Let \(|\Delta_j| = \dfrac{2A}{M}, M=2^R\), we obtain
\[\begin{align}
\mathbb E|X - \hat X|^2
&= \sum_j \mathrm{Pr}[X\in \Delta_j] \mathbb E_{|X\in \Delta_j} |X - \hat X|^2 \\
&= \sum_{j=1}^M \mathrm{Pr}[X\in \Delta_j] \dfrac{A^2} {3M^2}
= \dfrac{A^2}{3M^2} = \dfrac{A^2}{3} 2^{-2R}
\end{align}\]
The improvement of 1 bit in quantization leads to 4-fold drop (6dB) in rmse.
Panter-Dite ’51 asked the question: what is the best possible
way to arrange \(c_j\)’s? Again, the discrete combinatorial problem is hard:
even for \(M=4\) and Gaussian the solution is unknown.
- Fixing a partition \(c_j\), the decision rule is easy: just pick the nearest point!
- Fixing \(\Delta_j\)’s, the optimal \(c_j\) are the centroids i.e. conditional means
\(c_j\)
- Note that, fixing \(\Delta_j\)’s and computed \(c_j\)’s, the quantization
intervals \(\Delta_j\)’s are not optimal anymore. This leads to Lloyd-Max:
start with \(c_j\)’s, compute \(\Delta_j\)’s, then compute \(c_j\)’s, and
repeat by computing \(\Delta_j\)’s again such that they’re equidistant
from the \(c_j\)’s.
- Lloyd-Max yields the CVT (Centroidal Voronoi Tesselation), and
the optimal quantization is always a CVT.
Lloyd-Max yields a (not-necessarily optima) CVT.
- One key idea is to consider the point density \(\lambda\geq 0\) with
\(\int \lambda = 1\).
In the continuum limit of the entropy-maximizing method and fixing \(\lambda(x)\),
let’s parameterize \(c_j\) in terms of a free parameter \(\lambda_j\) by the following equation:
\[\begin{align}
c_j
&= \int_{-\infty}^{c_j} = \dfrac j m \\
|\Delta_j| \cdot \lambda(c_j)
&\approx \dfrac 1 M \implies |\Delta_j| = \dfrac 1 {M \lambda(c_j)} \\
\mathbb E|X - \hat X|^2
&=\sum \mathrm{Pr}[X\in \Delta_j] \mathbb E_{|X\in \Delta_j} |X - c_j|^2 \\
&\approx \sum \mathrm{Pr}[\lambda \in \Delta_j] \dfrac{|\Delta_j|^2}{12}
= \sum \mathrm{Pr}[X \in \Delta_j] \dfrac 1 {12M^2} \dfrac 1 {\lambda^2(c_j)} \\
&\approx \dfrac 1 {12M^2} \int p_X(x) \dfrac 1 {\lambda^2(x)}\, dx
\end{align}\]
This variational problem can be solved to yield \(\lambda^*(x) = c \cdot p(x)^{1/3}\).
This generalizes to non-quadratic losses!
In other words, to extremize the rmse, we should maximize the entropy
(i.e. have uniform distribution) of \(f_X^{1/3}\). It turns out that
\[
D_{\mathrm{scalar}}(R) = \dfrac{\pi \sqrt 3}{2} \sigma^2 2^{-2R}, \quad R\gg 1
\]
Quantization for quantum systems?
Vector quantization
By vector quantization, we mean i.i.d \(X_j\). For i.i.d Gaussians,
we obtain \(1.65\) better in RMSE, or 0.72 bit / sample
better (i.e. \(R\mapsto R+0.72\) kills the constant factor).
\[
D(R) = \sigma^2 2^{-2R}, \quad \forall R>0
\]
Given \(X^n\sim P_X\) i.i.d on \(\mathcal A\) with reconstruction alphabet \(\hat A\),
the distortion metric \(d(x, \hat x):\mathcal A\times \hat {\mathcal A}\to \bar{\mathbb R}\).
I.i.d. assumption is not as important, but the separability of distortion metric is:
\[
d(x^n, \hat x^n) = \dfrac 1 n \sum_{t=1}^n d(x_t, \hat x_t)
\]
Define an \((n, M, D)\)-code (or vector quantizer) to be the pair \((f, g)\) with
\[
X^n\xrightarrow f W\in [M] \xrightarrow g \hat X^n\in \hat{\mathcal A}
\]
such that \(\mathbb E[d(X^n, \hat X^n)] \leq D\). Every scalar quantizer can be tensorized
into a vector quantizer. Similarly, define the fundamental limit
\[
M^*(n, D) = \inf_{\exists(N, M, D)-\mathrm{code}} M
\]
Note that this is subadditive, since
\[
\log M^*(n_1+n_2, D)\leq \log M^*(n_1, D) + \log M^*(n_2, D)
\]
The rate distortion function is defined as the normalized
limit of this function
\[
R(D) = \lim_{n\to \infty} \dfrac 1 n \log M^*(n, D)
\]
For the non-separable metric \(d_n(X^n, \hat X^n) = 1_{X^n=\hat X^n}\)
we obtain lossless compression. The information version of the distortion rate is
\[\begin{align}
R^{(I)}(D)
&= \lim_{n\to \infty} \dfrac 1 n
\inf_{P_{\hat X^n|X^n} | \mathbb Ed_n(X^n, \hat X^n) \leq D}
I(X^n; \hat X^n)
\end{align}\]
For i.i.d. source and separable distortion, applying tensorization yields
\[
R^{(I)}(D) = \inf_{\mathbb Ed(X_1, \hat X_1)\leq D} I(X_1; \hat X_1)
\]
Moreover, the map \(D\mapsto R^{(I)}(D)\) is convex (follows by mixing the
rate-achieving channels and applying convexity of MI).
Theorem 12.13 For every \((n, M, D)\)-code, \(\log M \geq n R^{(I)}(D)\).
By definition, \(I(X^n; \hat X^n) \geq R^{(I)}(D)\);
on the other hand, the MI is upper-bounded by the
maximal entropy of the bottleneck \(\log M\).
The converse bound (analogous to shannon’s noisy coding theorem)
is obtained by a similar random coding technique.
Dec 4: Density estimation
Agenda:
- Review of metric entropy.
- \(\log N \propto 1/\epsilon\).
- Density estimation.
- Yang-Barron method.
- Lower bound.
Rule of thumb: the minimax rate of a class
\[
R_n^* \approx \inf_\epsilon \epsilon^2 + \dfrac 1 n \log N(\epsilon)
\]
Review of metric entropy
Given \(\Theta\subset V\) a metric space, define the metric entropy
\(N(\epsilon; \Theta)\) to be the minimum number of \(\epsilon\)-balls
which cover \(\Theta\). This characterizes how “compact” the space \(\Theta\) is.
Similarly, the packing number \(M(\epsilon; \Theta)\) is the maximum number
of pointwise \(\epsilon\)-far points contained in \(\Theta\). We have shown that
\[
M(\epsilon) \geq N(\epsilon) \geq M(2\epsilon)
\]
We also considered the class \(\mathcal F_\beta^d(C)\) as the class of
pdf’s such that \(f^{\lfloor \beta\rfloor}\) is \((\beta-\lfloor \beta\rfloor)\)-Holder.
An \(\alpha\)-Holder function satisfies \(|g(x) - g(y)| \leq C^\alpha |x-y|^\alpha\).
Foundational theorem: for \(L_{p\in [1, \infty]}\)-distances,
\(\log N(\epsilon, \mathcal F_\beta^d) = \Omega\, (1/\epsilon)^{d/\beta}\).
One immediate implication is the proof (of non-existence) of
Hilbert’s 13-th problem for \(1\)-Lipschitzz problems.
Consider samples \(X_1, \dots, X_n\) drawn i.i.d from
\(f\in \mathcal F_\beta^d\) on \([0, 1]^d\), or the collection of
\(\lceil \beta\rceil\)-differentiable densities which are \((\beta - \lfloor \beta\rfloor)\)-Holder.
Define the minimax rate to be
\[
R_{n, \mathrm{dist}}^*(\mathcal F_\beta^d)
= \inf_{\hat f_n:X^n\to \mathcal F_\beta^d} \sup_{f\in \mathcal F_\beta^d}
\mathbb E\, \mathrm{dist}(\hat f_n, f)
\]
Note that \(L_{p\neq 1}\) are nonstatistical as in a homogeneous rescaling
of the whole space will change the distance.
The foundational result in nonparametric statistics is as follows,
\[
R^*_{n, L_p} = \Omega\, n^{-\beta/(2\beta + d}
= \left(\dfrac{\log n}{n}\right)^{\beta/(2\beta+d)} \text{ for }p=\infty
\]
Today we will prove for \(p=2\). Note that for \(p\to \infty\),
the rate goes as \(\sqrt{1/n}\), which is a parametric rate!
Proceeding to the proof, define the contracted class
\[
F_\beta' = \dfrac 1 2 U + \dfrac 1 2 \mathcal F_\beta
\]
We first work with a contracted class with common support.
Lemma 12.3 \(R_{n, L_p}^*(\mathcal F_\beta')\) is the same as that
of \(\mathcal F_\beta\) up to a factor of \(2\).
Proof: One direction is trivial:
\(\mathcal F_\beta'\subset \mathcal F_\beta \implies R_n^*(\mathcal F_\beta') \leq R_n^*(\mathcal F_\beta)\). Next, given data from \(\mathcal F_\beta\),
use \(R_{n, L_p}^*(\mathcal F_\beta')\) to select an estimator for
the mixture with \(U\) (which belongs in \(\mathcal F_\beta'\)) and apply
Markov’s inequality.
Usually, the bottleneck in proving entropic characterizations is proving
\(H^2 \approx D\). Yang-Barron is one of the first applications
of information theory to obtain a statistical upper-bound.
Lemma 12.4 \(\forall f, g\in \mathcal F_\beta'\), up to global constants
the \(H^2, KL, \chi^2, L_2^2\) are equivalent.
Proof:
- First note that \(f, g\geq 1/2\), and there exists a upper bound \(C_{\beta, d}\)
such that \(f, g\leq C_{\beta, d}\).
- This immediately yields \(\chi^2 \sim L_2\) using the universal constant
\[
\chi^2(f\|g) = \int \dfrac{(f-g)^2}{f} \approx \int (f-g)^2
\]
The same applies to Hellinger with \((\sqrt x-1)^2 \geq c(x-1)^2\), so
\(H^2 \geq c\cdot L_2^2\).
Proceeding to prove the Yang-Barron method, we first use the
online-to-batch conversion: given \(X_j\) sampled i.i.d from \(P_{\theta\in \Theta}\)
and an estimator \(\hat \theta_n(X_1, \dots, X_n)\).
- The batch minimax risk \(R_n^*(\Theta) = \inf_{\hat \theta_n} \sup_{\theta\in \Theta} \mathbb E\, l(\theta, \hat \theta_n)\).
- The online (cumulative) risk
\[
C_n^*(\Theta) = \inf_{\hat \theta_1, \dots, \theta_n}
\sup_{\theta \in \Theta} \sum_{t=1}^n \mathbb E\, l(\theta, \hat \theta_t)
\]
Proposition 12.10 \((n+1)R_n^*(\Theta) \leq C_{n+1}(\Theta)\). Moreover,
if \(C_n \approx n^{\alpha>0}\), then \(R_n^*\approx n^{\alpha - 1}\).
The main constructive step in Yang-Barron is to take \(\theta\leftrightarrow f\),
and \(\hat \theta_t \leftrightarrow \hat f_t(\cdot; X_1, \dots, X_{t-1})\).
Take \(l(\theta, \hat \theta_t) = D(f\|\hat f_t)\).
- The batch loss is exactly the \(L_2\) risk for KL.
- What about the cumulative loss?
\[\begin{align}
C_n(\mathcal F)
&= \inf_{\hat f_0, \dots, \hat f_{n-1}} \sup_f \sum_{t=0}^{n-1} D(f\|\hat f_t) \\
&= \inf \sup D(f^{\otimes n} \| \hat f_0\times \dots \times \hat f_{n-1})
\end{align}\]
But the choice of \(\hat f_0, \dots, \hat f_{n-1}\).
The cumulative minimax risk in KL for density estimation is
\[
C_n(\mathcal F) = \inf_{Q_{X^n}} \sup_{f\in \mathcal F} D(f^{\otimes n} \| Q_{X^n})
= \sup_{\pi \in \mathcal P(\mathcal F)} I(f; X^n)
\]
Note that we can upper-bound the capacity using the metric entropy.
The final Yang-Barron estimator is an online exponential-
A recap of the proof sketch:
- Reduce \(\mathcal F_\beta\to \mathcal F_\beta'\)
to a manageable class with common support.
- Demonstrate that on \(\mathcal F_\beta'\), KL and \(L^2\)
are equivalent up to universal constants.
- Apply the online-to-batch reduction and recognize the
cumulative risk in terms of metric entropy.
Proceeding to prove the lower-bund, given a family \(\Theta\),
let \(M_H(\epsilon, \Theta)\) be the packing number unfer Hellinger.
The capacity
\[
C_n(\Theta) \geq \min \left[\dfrac{\log e}{2} n \epsilon^2,
\log M_H(\epsilon)\right] - \log 2
\]
To prove the lower bound, take \(\theta_1, \dots, \theta_M\)
such that \(H^2(P_{\theta_j}, P_{\theta_k}) \geq \epsilon^2\).
Recall that there exists hypothesis test for
\(P_{\theta_j}^{\otimes n}\) against \(P_{\theta_k}^{\otimes n}\) such that
(using TV-\(H^2\) comparison and tensorization of \(H^2\)):
\[
P_e \leq \exp \left[
n H^2(P_{\theta_j}, P_{\theta_k})/2
\right]
\]
To conduct \(M\)-ary hypothesis, take the decoded value \(\hat \theta\)
to be the winner of all pairwise tests (else error), then
\[
\mathrm{Pr}[\theta \neq \hat \theta] \leq M e^{-n\epsilon^2/2}
\]
Suppose \(\mathrm{Pr}[\theta \neq \hat \theta]<1/2\),
then use Fano’s inequality to lower-bound the mutual information
\[
C_n \geq I(\theta; X^n) \geq I(\theta; \hat \theta) \geq
\dfrac 1 2 \log M - \log 2
\]
Dec 11: SDPI, Distributed Estimation
Agenda:
- Review SDPI.
- Finish Spiked Wigner.
- Correlation estimation (Gap Hamming problem in TCS)
and distributed mean estimation (machine learning).
Recall the S-DPI constant
\[
\dfrac 1 2 \mathrm{diam}_{H^2}\leq \eta_{\mathrm{KL}}(P_{Y|X})
\leq \mathrm{diam}_{H^2}(P_{Y|X})
\]
Also recall the less-noisy definition:
- For all \(P_{UX}\), we have \(I(U; X)\leq I(U; Z)\implies P_{Y|X}\prec_{ln}P_{Z|X}\), so
\(Z\) is less noisy than \(Y\).
- Consider another “natural” definition: forall \(P_X\), we have \(I(X; Y) \leq I(X; Z)\);
this says that \(P_{Y|X}\prec_{mc}P_{Z|X}\) i.e. \(Z|X\) is more capable than \(Y|X\).
Why is less noisy better than more capable? First, it’s stronger, and it additionally
ensures that not only is the whole input \(X\) “better preseved,” but also any
subobject of \(X\). Properties of less noisy comparison:
- Tensorization: \(P_{Y|X}\preceq_{ln}P_{Z|X} \iff \forall n: P_{Y|X}^{\otimes n}\preceq_{ln}P_{Z|X}^{\otimes n}\).
- \(\eta_{KL}(P_{Y|X})\leq 1-\tau \iff P_{Y|X} \preceq_{ln} EC_\tau\).
Spiked Wigner
Consider an ambient \(W_{jk}=W_{kj} \sim \mathcal N(0, 1)\), and
\(X_j\sim \mathrm{Unif}\{\pm 1\}\) for some “signal”, we observe
\[
Y = \sqrt{\dfrac\lambda n} XX^T + W
\]
and estimator \(\hat X(Y)\),
- If \(\lambda \leq 1\), then \(\forall X: \dfrac 1 n \mathbb E|\hat X^TX|\to 0\).
i.e. if the signal is small, then asymptotically we cannot even learn about planted signals.
- If \(\lambda>1\), then for \(\hat X\) being the top eigenvalue of \(Y\), the correlation is bounded
away from \(0\): \(\dfrac 1 n \mathbb E|\hat X^TX|\geq \epsilon_0(\lambda)>0\).
- This follows from random matrix theory.
Recall inference on graphs: consider a graph with vertices \(X_j\sim \mathrm{Unif}\pm 1\).
For every edge \(e=(u, v)\), we have access to a channel \(P_{Y_e|X_e}\).
For the spiked wigner problem, we have a complete graph and
\[
Y_e = \sqrt{\dfrac\lambda n} X_uX_v + \mathcal N(0, 1)
\]
Consider the channel \(P_{Z_l|X_l} = EC_\tau(X_uX_v)\).
How to compute \(I(X_1; X_2|Z_E)\)?
- If there is an open path (i.e. without erasure) between \(X_1, X_2\),
then we know \(X_1X_2\) with certainty.
Unrolling the definition yields
\[
I(X_1; X_2|Z_E) = \mathrm{Pr}[\exists\text{open path in } ER(1-\tau, n)]
\]
Theorem (Erdos-Renyi random graph model):
\(ER(\lambda/n, n)\) for \(\lambda<1\) are of size \(O(\log n)\); in this regime,
the probability that \(X_1, X_2\) are connected goes as \(\log n / n \to 0\).
For \(\lambda>1\), there exists a giant connected component of size \(\Omega(n)\).
Back to spiked Wigner, we compute the \(\eta_{KL}\) of
\((X_1, X_2)\mapsto X_1X_2\mapsto Y=\sqrt{\lambda/n} X_1X_2 = W\),
apply less noisy reduction. Retracing the steps:
- Start with random matrix theory,
we recognized the reduction to graph inference.
- Graph inference is easily bounded for the \(EC_\tau\) channel using ER theory.
- Reduction to the relatively easy \(EC_\tau\) is done using SDPI theory.
Correlation estimation
Given i.i.d. pairs \((X_j, Y_j)\) drawn from \(P_{XY}\) with uniform marginal on \(\pm 1\)
and correlation \(\rho\); equivalently,
\(X\xrightarrow{\mathrm{BSC}_\delta} Y\) with \(1-2\delta = \rho\).
Suppose we wish to estimate \(\rho\) between two different partieis
holding \(X, Y\) respectively, using \(B\) bits. Claim:
\[
\mathbb E(\rho - \hat \rho)^2 = \dfrac 1 {2\log 2} \cdot \dfrac{1+o(1)}{B}
\]
- Contraction coefficient tensorizes (invariant) assuming independent inputs.
- Removing this constraint, the SDPI coefficient always goes to \(\infty\) due to the
existence of error-correcting codes.
Yury’s great quote: popularity is not quality;
Macdonalds produce billions of dollars of gross scales,
but the Michelin restaurants are not upset about it.