4 Mutual Information

Key takeaways:

The graph direction in Markov models are somewhat redundant: \[ X\to Y\to Z \iff X\leftarrow Y\to Z\iff Z\to Y\to X \] The key is \(X\perp\!\!\!\perp Z\|Y\).
In light of how divergence-DPI implies mutual information-DPI (proposition 4.2), given a divergence \(\mathcal D\) satisfying DPI, we can define a mutual-information like quantity \[ I_{\mathcal D}(X; Y) = \mathbb E_{x\sim P_X} \left[\mathcal D(P_{Y|X=x} \| P_Y)\right] = D(P_{Y|X} \| P_Y | P_X) \]
The key factorization result for mutual information is 4.2.
Mutual information is the weighted divergence from the marginal to the conditionals (definition 4.1).
General proof technique (see 4.3); to derive bounds on general statements, derive one for a tractable statement then apply divergence inequalitites (DPI, Donsker-Varadhan etc).

Definition, properties

Definition 4.1 (Mutual information) The mutual information between \(X, Y\) is (Fano’s definition) \[ I(X; Y) \equiv D(P_{X, Y}\|P_XP_Y) = D(P_{Y|X} \| P_Y | P_X) \] Shannon’s definition is not general enough for infinite cases, even when the alphabet is discrete. \[ I(X; Y) = H(X) - H(X|Y) \] The conditional mutual-information is \[ I(X; Y|Z) = \mathbb E_{z\in P_Z} I(X; Y|Z=z) \]

Theorem 4.1 (properties of mutual information) Mutual information satisfies:

Conditional expression: \(I(X; Y) = D(P_{XY}\|P_XP_Y) = D(P_{Y|X} \| P_Y\,|\, P_X)\).
Symmetry: \(I(X; Y) = I(Y; X)\).
Positivity: \(I(X; Y)\geq 0\) with equality saturated iff \(X\perp\!\!\!\perp Y\).
Transformations destroy information: \(I(f(X); Y) \leq I(X; Y)\) with equality iff \(f\) is injective.
Data create information: \(I(X, Y; Z) \geq I(X; Z)\) with equality iff \(Y\) is a deterministic transform of \(X\).

Proof: The conditional expression follows from the decomposition in theorem 3.4. For symmetry, consider a swap-channel on the joint which maps \(P_XP_Y\mapsto P_YP_X\), then applying the KL-DPI 3.5 to the joint \(K\circ P_{XY} = P_{YX}\) and the marginal \(K\circ P_XP_Y = P_YP_X\) yields \[ D(P_{XY} \| P_XP_Y) = D(P_{YX} \| P_YP_X) \iff I(X; Y) = I(Y; X) \] the DPI inequality is made into an equality by a symmetry argument. Positivity is established by the information inequality 3.1. For claim (4), consider the kernel \(K\circ (X, Y) = (f(X), Y)\) and apply KL-DPI \[ D(P_{f(X)Y} \| P_{f(X)}P_Y) \leq D(P_{XY} \| P_XP_Y) \iff I(f(X); Y) \leq I(X; Y) \] For injective \(f\), apply the argument to the inverse. For the last claim, apply (4) to the projection transform \((X, Y)\mapsto X\).

Proposition 4.1 (Mutual information and entropy)

\(I(X; X) = \begin{cases} H(X) & X \text{ discrete.} \\ \infty \text{ otherwise.}\end{cases}\)
Given discrete \(X\), \(H(X) = I(X; Y) + H(X|Y)\).
Given both \(X, Y\) discrete, \(H(X)+H(Y) = I(X; Y) + H(X; Y)\).

Proof: (2) and (3) follow from direct computation. Given \(X\) discrete with alphabet-size \(n\), we have \[ I(X; X) = D(P_{XX} \| P_XP_X) = \mathbb E_{XX\sim P_{XX}} \log \dfrac{P_X(X)}{P_X(X)^2} = H(X) \] note that \(XX\sim P_{XX}\cong X\sim P_X\). For infinite-alphabet, consider a kernel \(K_m\) which projects \(X\) onto the univariate \(\mathrm{Unif}(0, 1)\) distribution then takes the first \(m\) decimals, then \[ I(X; X) > I(K_m(X); X) = H(K_m(X)) - H(K_m(X)|X)_{=0} = m\log 2 \] this holds for all \(m\).

Conditional MI

We first briefly talk about causal graphs. Consider a Markov graph \(X\to Y\to Z\). This unrolls to \[\begin{align} X\to Y\to Z &\iff P_{X, Y, Z} = P_{Z|Y}P_{Y|X}P_X \\ &\iff P_{X, Z}|Y = P_{X|Y} P_{Z|Y} \iff X\perp\!\!\!\perp Z | Y \\ &\iff P_{X, Y, Z} = P_{X, Z|Y} P_Y = P_{X|Y} P_{Z|Y} P_Y \\ &\iff X\leftarrow Y \rightarrow Z \\ &\iff Z\rightarrow Y\rightarrow X \end{align}\] A variable \(V\) is a collider on some undirected path if \[ \cdots \to V \leftarrow \cdots \] Two subsets of vertices \(A, B\) are \(d\)-connected by a subset \(C\) if there exists an undirected path from \(a\in A\to b\in B\) such that:

There are no non-colliders in \(C\).
Every collider is either in \(C\) or has a descendent in \(C\).

\(A\perp\!\!\!\perp B|C\) in every distribution satisfying the graphical model iff \(A, B\) are not \(d\)-connected by \(C\).

Definition 4.2 (conditional mutual information) Given standard Borel \(\mathcal X, \mathcal Y\), define \[ I(X; Y|Z) = D(P_{X, Y|Z} \| P_{X|Z} P_{Y|Z} \, | \, P_Z) = \mathbb E_{z\sim P_Z}\left[I(X; Y|Z=z)\right] \]

Theorem 4.2 (more properties of mutual information) Given standard Borel R.Vs, then

\(I(X; Z|Y)\geq 0\) with equality iff \(X\perp\!\!\!\perp Z | Y\).
Chain rule (Kolmogorov identities):
\[ I(X, Y; Z) = I(X; Z) + I(Y; Z|X) = I(Y; Z) + I(X; Z|Y) \]
DPI for mutual information: given \(X\to Y\to Z\), then \[ I(X; Z) \leq I(X; Y) \] with equality iff \(X\to Z\to Y\).
Full chain rule: \(I(X^n; Y) = \sum_{k=1}^n I(X_k; Y|X^{k-1})\).
Bijective invariance: given bijective \(f, g\), then \(I(fX; gY) = I(X; Y)\).

Proof: Non-negativity follows from the divergence definition and the information inequality: \[ I(X; Z|Y) = D(P_{XZ}P_Y \| P_XP_ZP_Y) \geq 0 \] For the chain rule, factor the LHS using proposition 3.2 against \(X, Z\) to obtain \[\begin{align} I(X, Y; Z) &= D(P_{XYZ} \| P_{XY}P_Z) = D(P_{XZ} \| P_XP_Z) + D(P_{XYZ} \| P_{Y|X} P_{XZ}) \\ &= I(X; Z) + D(P_{YZ|X} P_X \| P_{Y|X} P_{Z|X} P_X) = I(X; Z) + I(Y; Z |X) \end{align}\] where factoring against \(XZ\) yields \(P_{XY}P_Z = P_{Y|X}(P_XP_Z)\) and \(P_XP_Z\) is replaced with the marginal \(P_{XZ}\) of \(P_{XYZ}\). The other identity follows from symmetry. For the DPI, apply the Kolmogorov identity to \(I(Y, Z;X)\) to obtain \[\begin{aligned} I(Y, Z ; X) &= I(Y; X) + I(Z; X | Y)_{=0} \\ &= I(Z; X) + I(Y; X | Z) \end{aligned}\]

equality is fulfilled iff \(X\to Z\to Y\). Bijective invariance follows from applying DPI to \(f\) then \(f^{-1}\): consider the equally valid chains under bijection: \[ Y\to X\to fX \implies I(fX; Y) \leq I(X; Y), \quad Y\to fX\to X \implies I(X; Y) \leq I(fX; Y) \]

Remark (mutual information chain rule). One should remember the Kolmogorov identities as a decomposition of \(I(X, Y; Z)\) into an individual component \(I(X; Z)\) plus the second term \(I(Y; Z|X)\), which is \(I(Y; Z)\) adjusted for over-counting.

Corollary 4.1 If \(X\to Y\to Z\to W\), then \(I(X; W) \leq I(Y; Z)\).

Proof: Several applications of chain rule: \(I(Y; Z) \geq I(X; Z) \geq I(X; W)\).

Remark (incomparable mutual information under conditioning). Under a Markov chain \(X\to Y\to Z\), mutual information decreases upon conditioning. To find an example such that \(I(X; Y|Z) > I(X; Y)\), the only non-isomorphic graph is \[ X\to Y\leftarrow Z \] Let \(X, Z\sim \mathrm{Ber}(1/2)\) i.i.d and \(Y=X\oplus Z\), then \(X\perp\!\!\!\perp Y \implies I(X; Y)=0\) and \[ I(X; Y|Z) = H(X, Y|Z) - H(X|Y, Z) = \log 2 \]

Proposition 4.2 (alternate proof of mutual information DPI) Mutual information DPI is implied by the divergence-DPI theorem 3.5. Given a Markov chain \(X\to Y\to Z\), we have \[ I(X; Z) = D(P_{XZ} \| P_XP_Z) = D(P_{Z|X} \| P_Z | P_X) \leq D(P_{Y|X} \| P_Y | P_X) = I(X; Y) \]

Proof: The markov kernel in question is simply \(P_{Z|Y}\) applied to \((P_{Y|X=x}, P_Y) \mapsto (P_{Z|X=x}, P_Z)\).

Probability of error, Fano’s inequality

Given a R.V. \(W\) and our prediction \(\hat W\), we can

Guess randomly: \(W\perp\!\!\!\perp\hat W\).
Guess with data: \(W\to X\to \hat W\), where \(X=f(W)\) is a deterministic function of \(W\).
\(W\to X\to Y\to \hat W\), where \(X\to Y\) is some noisy channel.

The following statement upper bounds the discrete entropy based on the probability of correctness for random guessing and the maximum probability. We would guess that:

For the maximum probability of the distribution, the bound should be largest when \(P_0 = 1/M\) and decay as \(P_0\) increases.
The maximum collision probability is also \(1/M\) across all distributions.

The second statement in the following theorem can be viewed as a bound on entropy using the \(\infty\)-Renyi entropy.

Theorem 4.3 (entropy bound using independent guess) Given a finite alphabet \(|\mathcal X|=M<\infty\), for any \(\hat X\perp\!\!\!\perp X\) \[ H(X) \leq F_M(\mathrm{Pr}[X=\hat X]), \quad F_M(x) = (1-x)\log(M-1)+h(x) \] and \(h(x)=-x\log x - \bar x \log \bar x\) is the binary entropy. Let \(P_0 = \max_{x\in \mathcal X} P_X(x)\), then \[ H(X) \leq F_M(P_0) \] with equality iff \(P_X=(P_0, \alpha, \cdots)\) where \(\alpha=(1-P_0)/(M-1)\). Note that \(F_M(0)=\log(M-1)\), \(F_M(1/M)=\log M\), and \(F_M(1)=0\).

Proof technique: instead of studying \(P_{X\hat X}\) for arbitrary \(\hat X\), introduce a tractable \(Q_{X\hat X}\), to bound an event, then use DPI (or divergence inequalities) to bound the same event for \(P_{X\hat X}\) together with some divergence quantity.

Proof

For the first inequality, consider \(Q_{X\hat X} = U_XP_{\hat X}\) with \(U_X\) uniform, then \(Q[X=\hat X]=1/M\); let \(P[X=\hat X]=p\), the applying divergence DPI to \((X, \hat X)\mapsto 1[X=\hat X]\) yields \[\begin{align} D(P_{X\hat X} \| Q_{X\hat X}) &= \mathbb E_{X, \hat X\sim P} \log M \dfrac{P(x)^2}{P(x)} = \log M - H(P_X) \\ &\geq d(p\|1/M) = p \log pM + \bar p \log \dfrac{M \bar p}{M-1} \\ &= -h(p) + p\log M + (1-p) \log M - \bar p\log(M-1) \\ &= -h(p) + \log M + \bar p \log(M-1) \\ H(P_X)&\leq h(p) + \bar p \log(M-1) \end{align}\] To prove the second claim, choose \(\hat X\) to be the mode of \(X\).

Theorem 4.4 (Fano's inequality) Let \(|\mathcal X|=M\) and \(X\to Y\to \hat X\). Let \(p_e = P[X\neq \hat X]\), then \[ H(X|Y) \leq F_M(1-p_e) = p_e\log(M-1) + h(p_e) \] Furthermore, for \(p_0 = \max P_X(x)\), then regardless of \(M\) we have \[ I(X; Y) \geq (1 - p_e) \log \dfrac 1 {p_0} - h(p_e) \]

Proof: Apply the same proof as the previous theorem with \(U_X\) uniform.