6 Extremization

Key perspectives:

Capacity: Given channel \(P_{Y|X}\), maximize \(I(X; Y)\) over a convex set of inputs \(P_X\).
Rate-distortion: Given \(P_X\), minimize \(I(X; Y)\) over a convex set of \(P_{Y|X}\).
Maximum likelihood: Given \(P\), minimize \(D(P\|Q)\) over a class of \(Q\).
Information projection: Given \(Q\), minimize \(D(P\|Q)\) over a convex class of \(P\).

The first two minimax objectives correspond to the concavity of \(I(P_X; P_{Y|X})\) in the first argument and convexity in the second argument. The last two objectives correspond to the convexity of \(D(P\|Q)\) in both arguments.

Key mathematical ideas:

Convexity of \(D(P\|Q)\) is equivalent to “conditioning increases divergence” (proposition 3.4) \(\iff\) “mixing decreases divergence.”
For convexity proofs, introduce a Bernoulli latent variable, then use chain rule to decompose the quantity (MI, KL, entropy, etc) into the two compared components.
The saddle point of mutual information yields a game-theoretic perspective to the duality between achieving capacity and rate distortion: corollary 6.2.
The capacity of a channel is the radius of posteriors under divergence.

Convexity

Theorem 6.1 (KL convexity) The map \((P, Q)\mapsto D(P\|Q)\) is convex: \[ \forall \lambda \in [0, 1]: \lambda D(P_0 \| Q_0) + \bar \lambda D(P_1\|Q_1) \geq D(\lambda P_0 + \bar \lambda P_1 \| \lambda Q_0 + \bar \lambda Q_1) \]

Proof: The first quantity can be seen as an expected divergence over \(R\sim \mathrm{Ber}(\lambda)\): \[ \lambda D(P_0 \| Q_0) + \bar \lambda D(P_1\|Q_1) = D(P\|Q | R), \quad (P, Q)_{R=j} = (P_j, Q_j) \] The second quantity is simply the divergence of the marginal, then by chain rule we have \[\begin{align} D(P\|Q | R) = D(PR \|QR) \geq D(P\|Q) \end{align}\]

Theorem 6.2 (concavity of entropy) \(P_X\mapsto H(X)\) is concave. Fixing channel \(P_{Y|X}\), \(P_X\mapsto H(X|Y)\) is concave and continuous if \(\mathcal X\) finite.

Proof: The first proof is complete by the KL-entropy relation \(H(X) = \log |\mathcal X| - D(P_X\|U_X)\). The second proof follows by a similar latent variable argument: consider \(U\sim \mathrm{Ber}(\lambda)\) and \(U\to X\to Y\); let \(f(P_X) = H(X|Y)\), then \[\begin{align} f(\lambda P_0 + \bar \lambda P_1) = H(X|Y), \quad \lambda f(P_0) + \bar \lambda f(P_1) = H(X|Y, U) \end{align}\] concavity follows from \(H(X|Y) \leq H(X|Y, U)\). Continuity follows \(H(Y|X) = H(Y) - I(X; Y)\) both continuous.

Theorem 6.3 (extremality of MI)

Fixing channel \(P_{Y|X}\), \(P_X\mapsto I(P_X, P_{Y|X})\) is concave.
Fixing input distribution \(P_X\), \(P_{Y|X} \mapsto I(P_X, P_{Y|X})\) is convex.

Proof Consider 3 proofs for the first statement: \[ \lambda I(P^0_X; P_{Y|X} \circ P^0_X) + \bar \lambda I(P^1_X; P_{Y|X} \circ P^1_X) \geq I(P_X; P_{Y|X}\circ P_X), \quad P_X = \lambda P^0_X + \bar \lambda P^1_X \]

Standard latent variable proof: Consider the standard latent variable proof \(Z\to X\to Y\) with \(Z\sim \mathrm{Ber}(\lambda)\perp\!\!\!\perp Y\), then the RHS is \(I(X; Y)\) while the LHS is \[ I(Y; X, Z) = I(Y; X | Z) + I(Y; Z)_{=0} \] the result follows from the fact that mutual information increases with more deta.
Center of gravity formula: Use corollary 5.1: \(I(X; Y) = \min_{Q_Y} D(P_{Y|X} \|Q_Y | P_X)\); this is the pointwise minimum of affine functions in \(P_X\) hence concave.
Golden formula: Since \(I(X; Y) = D(P_{Y|X} \|Q_Y|P_X) - D(P_Y\|Q_Y)\), the map \(P_X\mapsto D(P_{Y|X} \circ P_X \|Q_Y)\) is convex and the first term is affine, so the combination is concave.

To prove the second argument, note that \(I(X; Y) = D(P_{Y|X} \| P_Y|P_X)\); here \(D(P_{Y|X=x} \| P_Y)\) is jointly convex, and \(P_Y\) is linear function of \(P_{Y|X}\).

Minimax and saddle-point

Proposition 6.1 (minimax inequality) \[ \inf_y \sup_x f(x, y) \geq \sup_x \inf_y f(x, y) \] Whichever operation acts first strictly dominates.

Proof: Fixing \(y=y_0\) on the LHS, \(\sup_x f(x, y_0) \geq \sup_x \inf_y f(x, y)\) by \(f(x, y_0)\geq \inf_y f(x, y)\).

Minimax equality is implied by the existence of a saddle point \((x^*, y^*)\) such that \[ f(x, y^*) \leq f(x^*, y^*) \leq f(x^*, y), \quad \forall x, y \] here \(x^*\) is the dominant strategy given \(y^*\), and \(y^*\) is the dominant strategy given \(x^*\).
If \(\inf, \sup\mapsto \min, \max\), then equality implies the existence of a saddle point.
von Neumann: Given \(A, B\) with finite alphabets and \(g(A, B)\) arbitrary, then \[ \min_{P_A} \max_{P_B} \mathbb E[g(A, B)] = \max_{P_B} \min_{P_A} \mathbb E[g(A, B)] \] this is a special case of minimax with \(f(x, y) = \sum_{ab} P_A(a)P_B(b) g(a, b)\).

Theorem 6.4 (minimax theorem) If \(\mathcal X, \mathcal Y\) are compact domains in \(\mathbb R^n\), and \(f(x, y)\) is continuous in \((x, y)\), concave in \(x\) and convex in \(y\), then \[ \max_x \min_y f(x, y) = \min_y \max_x f(x, y) \] In particular, this implies the existence of a saddle point.

Capacity, Saddle point of MI

Definition 6.1 (capacity, caid, caod) The capacity of a channel \(P_{Y|X}\) over a set \(\mathcal P\) of (usually convex) input distributions is \[ C = \sup_{P_X\in \mathcal P} I(P_X, P_{Y|X}) = \sup_{P_X} D(P_{Y|X}P_X \| (P_{Y|X}\circ P_X)P_X) \] If equality is saturated by \(P_X^*\in\mathcal P\), then \(P_X^*\) is a capacity-achieving input distribution (caid) and \(P_Y^* = P_{Y|X}\circ P_X^*\) is the capacity-achieving output distribution (caod).

Theorem 6.5 (implications of MI saddle point) Fixing a convex \(\mathcal P\) on \(\mathcal X\). The existence of a caid implies, for every \(P_X\in \mathcal P, P_Y\in P_{Y|X} \circ \mathcal P\), \[ D(P_{Y|X} \| P_Y^* |P_X) \leq D(P_{Y|X} \| P_Y^* | P_X^*) \leq D(P_{Y|X} \| Q_Y | P_X^*) \]

Proof: The second inequality is simply center of gravity formula 5.1 applied to the middle quantity \(C=I(X^*, Y^*) = D(P_{Y|X} \| P_Y^* | P_X^*)\). For the first inequality, let \(C<\infty\) and let \[ P_{X_\lambda} = \lambda P_X + \bar \lambda P_X^*, \quad P_{Y_\lambda} = P_{Y|X} \circ P_{X_\lambda} \implies P_{Y_\lambda} = \lambda P_Y + \bar \lambda P_Y^* \] The following chain yields (the third line applies the center of gravity characterization again) \[\begin{align} C &\geq I(X_\lambda; Y_\lambda) = D(P_{Y|X} \| P_{Y_\lambda} | P_{X_\lambda}) \\ &= \lambda D(P_{Y|X} \| P_{Y_\lambda} | P_X) + \bar \lambda D(P_{Y|X} \| P_{Y_\lambda} | P_{X^*}) \\ &\geq \lambda D(P_{Y|X} \| P_{Y_\lambda} | P_X) + \bar \lambda C \\ C &\geq D(P_{Y|X} \| P_{Y_\lambda} | P_X) = D(P_{Y|X} P_X \| P_{Y_\lambda} P_X) \end{align}\] Apply lower semi-continuity 5.6 by taking \(\liminf_{\lambda\to 0}\) to obtain the desired quantity \[ \liminf_{\lambda\to 0} P_{Y_\lambda} P_X = P_{Y^*}P_X \implies D(P_{Y|X} \| P_{Y^*} | P_X) \leq \liminf_{\lambda\to 0} D(P_{Y|X} P_X \| P_{Y_\lambda} P_X) \leq C \]

Corollary 6.1 (uniqueness of caod) In addition to the assumptions in theorem 6.5, if capacity is finite, then the caod \(P_Y^*\) is unique and satisfies \[ D(P_{Y|X} \circ P_X \| P_Y^*) \leq C < \infty, \quad \forall P_X\in \mathcal P \] In particular, KL-finite implies \(P_Y \ll P_Y^*\).

Proof: Recognize the capacity as a decomposed component in the divergence chain rule 3.2: \[\begin{align} C = D(P_{X^*Y^*} \| P_{Y^*}P_{X^*}) &= D(P_{X^*Y^*} \| P_{Y^*} P_X) - D(P_{Y^*} \| P_Y) \\ &= D(P_{X|Y} \| P_X | P_{Y^*}) - D(P_{Y^*} \| P_Y) \\ &\geq D(P_{X|Y} \| P_{X^*} | P_{Y^*})_{=C} - D(P_{Y^*} \| P_Y) \end{align}\] Saturated equality implies \(D(P_{Y^*} \| P_Y)=0\iff P_Y=P_{Y^*}\).

Note that the caid need not be unique.

Corollary 6.2 (minimax MI) Under saddle point assumptions, we additionally have \[\begin{align} C = \max_{P_X\in \mathcal P} I(X; Y) &= \max_{P_X\in \mathcal P} \min_{Q_Y} D(P_{Y|X} \| Q_Y | P_X) \\ &= \min_{Q_Y} \sup_{P_X\in \mathcal P} D(P_{Y|X} \| Q_Y|P_X) \end{align}\]

Proof: The first \(\max\min\) equation comes from the center of gravity characterization 5.1. The second equation comes from applying center of gravity to the left inequality of theorem 6.5 \[ C = D(P_{Y|X} \| P_Y^* | P_X^*) = \max_{P_X\in \mathcal P} D(P_{Y|X} \| P_Y^* | P_X) = \min_{Q_Y} \sup_{P_X\in \mathcal P} D(P_{Y|X} \| P_Y^* | P_X) \]

Capacity as information radius

Some review: given metric space \((X, d)\) and bounded set \(A\).

Definition 6.2 (radius, diameter) The (Chebyshev) radius is the radius of the smallest ball that covers \(A\). \[ \mathrm{rad}(A) = \inf_{y\in X}\sup_{x\in A} d(x, y) \] The diameter of \(A\) is the least upper bound on two distances in the set \[ \mathrm{diam}(A) = \sup_{x, y\in A} d(x, y) \]

Compare the radius to the second equation in corollary 6.2: channel capacity is the information radius of the conditionals.

Corollary 6.3 Given finite \(\mathcal X\) and channel \(P_{Y|X}\), the maximal mutual information over all input distributions satisfies \[ \max_{P_X} I(P_X; P_{Y|X}) = \max_{x\in \mathcal X} D(P_{Y|x=x} \| P_{Y^*}) \]

Proof: This follows from the left equality \(C=\max_{P_X\in \mathcal P} D(P_{Y|X} \| P_Y^* | P_X)\): for finite alphabets, concentrate weight in the single \(x\in \mathcal X\) which maximizes \(D(P_{Y|X=x} \| P_Y^*)\).