7 Tensorization

  1. Theorem 7.1 states that the MI-maximizing input for product channel is product input, and the MI-minimizing channel for product input is the product channel.
  2. Gaussian saddle point problem: solve a \(n=1\) problem by going to \(n>1\) problem and use convexity / concavity of information measures and symmetry to obtain the optimal solution.
  3. Stationary processes have an uppder limit on entropy rate (theorem 7.3).
  4. Entropy rate of HMM has a sandwich limit (proposition 7.1).

MI, Gaussian saddle point

Theorem 7.1 (MI behavior)

  1. For a memoryless channel \(P_{Y^n|X^n} = \prod P_{Y_j|X_j}\) \[ I(X^n; Y^n) \leq \sum_{j=1}^n I(X_j; Y_j) \tag{7.1} \] with equality iff \(P_{Y^n} = \prod P_{Y_j}\).
  2. The (unconstrained) capacity is additive for memoryless channels: \[ \max_{P_X^n} I(X^n; Y^n) = \sum_{j=1}^n \max_{P_{X_j}} I(X_j; Y_j) \]
  3. If the source is memoryless \(P_{X^n} = \prod P_{X_j}\), then \[ I(X^n; Y) \geq \sum_{j=1}^n I(X_j; Y) \tag{7.2} \] with equality iff \(P_{X^n|Y} = \prod P_{X_j|Y}\) almost-surely under \(P_Y\).
  4. \(\min_{P_{Y^n|X^n}} I(X^n; Y^n) = \sum_j \min_{P_{Y_j|X_j}} I(X_j; Y_j)\).
Proof For (1), recall definition 4.1 for MI, introduce an additional product and cross-identify the terms \[\begin{align} I(X^n; Y^n) &= D(P_{X^nY^n} \| P_{X^n} P_{Y^n}) = D\left(\prod P_{Y_j|X_j}\| P_{Y^n} | P_{X^n}\right) \\ &= \mathbb E_{X^n} \left[ \log \dfrac{P_{Y^n|X^n}}{P_{Y^n}} \cdot \dfrac{\prod P_{Y_j}}{\prod P_{Y_j}} \right] = \mathbb E \log \prod \dfrac{P_{Y^j|X^j}}{P_{Y_j}} - \mathbb E\log \dfrac{P_{Y^n}}{P_{Y_j}} \\ &= \sum_{j=1}^n I(X_j; Y_j) - D\left(P_{Y^n} \| \prod P_{Y_j}\right) \end{align}\] (1) naturally implies (2). To show (3), again use the product condition and cross-identify terms \[\begin{align} I(X^n; Y) &= \mathbb E_{X^n, Y} \log \dfrac{P_{X^n|Y}}{P_{X^n}} \cdot \prod \dfrac{P_{X_j|Y}}{P_{X_j|Y}} \\ &= \mathbb E\log \prod \dfrac{P_{X_j|Y}}{P_{X_j}} + \mathbb E\log \dfrac{P_{X^n|Y}}{\prod P_{X_j|Y}} \\ &= \sum_{j=1}^n I(X_j; Y) + D\left(X^n \| \prod P_{X_j} | Y\right) \end{align}\] Now (3) implies (4), since the saturation condition and memoryless source implies that the channel is memoryless.

Several examples to the theorem above:

  1. Non-product input for non-product channels: Let \(X_1\perp\!\!\!\perp X_2\sim \mathrm{Ber}(1/2)\) and \(Y_1=X_1+X_2, Y_2=X_1\), then \(I(X_1; Y_1)=I(X_2; Y_2)=0\) but \(I(X_1, X_2; Y_1, Y_2)=2\).
  2. Strict inequality for 7.1.1: for all inputs equal \(Y_k=X_k=U\sim \mathrm{Ber}(1/2)\), we have \(I(X_k; Y_k)=I(X^n; Y^n)=1\leq n\).
  3. Strict inequality for 7.1.3: Let \(X_1 \perp\!\!\!\perp X_2\sim \mathrm{Ber}(1/2)\) and \(Y=X_1\oplus X_2\).

Theorem 7.2 (extremality of Gaussian channels) Let \(X_g\sim \mathcal N(0, \sigma_X^2), N_g\sim \mathcal N(0, \sigma_N^2)\perp\!\!\!\perp X_g\), then

  1. Gaussian capacity: \[ C = I(X_g; X_g + N_g) = \dfrac 1 2 \log \left( 1 + \dfrac{\sigma_X^2}{\sigma_N^2} \right) \]
  2. Gaussian input is the caid for Gaussian noise: under the power constraint \(\mathrm{Var}(X) \leq \sigma_X^2\) and \(X\perp\!\!\!\perp N_g\) \[ I(X; X+N_g) \leq I(X_g; X_g+N_g) \] with equality saturated iff \(X=X_g\) (in distribution).
  3. Gaussian noise is the worst for Gaussian input: under the power constraint \(\mathbb E[N^2] \leq \sigma_N^2\) and \(\mathbb E[X_gN] = 0\) \[ I(X_g; X_g+N) \geq I(X_g; X_g+N_g) \] with equality iff \(N=N_g\) (in distribution and \(N\perp\!\!\!\perp X_g\).
Proof by direct calculation

Define \(Y_g=X_g+N_g\) and the conditional divergence \[\begin{align} f(x) &= D(P_{Y_g|X_g=x} \| P_{Y_g}) = D(\mathcal N(x, \sigma_N^2) \| \mathcal N(0, \sigma_X^2 + \sigma_N^2)) \\ &= \dfrac 1 2 \log \left(1 + \dfrac{\sigma_X^2}{\sigma_N^2}\right) + \dfrac{\log e}{2} \dfrac{x^2 - \sigma_X^2}{\sigma_X^2 + \sigma_N^2} \end{align}\]

  1. By definition 4.1 of mutual information, \(I(X_g; X_g+N_g) = \mathbb E[f(X_g)] = C\).
  2. By center of gravity formula 5.1, \[ I(X; X+N_g) \leq D(P_{Y|X} \| P_{Y_g} |P_X) \leq C \] By the uniqueness of caod 6.1, we have \(P_Y=P_{Y_g}\implies X\sim \mathcal N(0, \sigma_X^2)\).
  3. Let \(P_{Y|X_g}\) denote the kernel \(Y=X_g+N\), apply theorem 5.4, Baye’s formula \[\begin{align} I(X_g; Y=X_g+N) &\geq \mathbb E\log \dfrac{dP_{X_g|Y_g}(X_g|Y)}{dP_{X_g}(X_g)} = \mathbb E\log \dfrac{dP_{Y_g|X_g}(Y|X_g)}{dP_{Y_g}(Y)} \\ &= C + \dfrac{\log e}{2} \mathbb E\left[ \dfrac{Y^2}{\sigma_X^2 + \sigma_N^2} - \dfrac{N^2}{\sigma_N^2} \right] \end{align}\] The first inequality is saturated when \(P_{X_g|Y_g} = P_{Y|X_g}\); the second term implies that \(X_g\) must be conditionally Gaussian. This is enough to imply that \(Y\) is Gaussian.
Proof by symmetry and tensorization

Proof of (1), (2): suppose we wish to solve, for \(Z^n\sim \mathcal N(0, I_n)\), the following equation. Apply equation (7.1) to obtain \[ \max_{\mathbb E[\sum X_k^2]\leq n \sigma_X^2} I(X^n; X^n+Z^n) = \max_{\mathbb E[\sum X_k^2]\leq n \sigma_X^2} \sum_{k=1}^n I(X_k; X_k+Z_k) \] Next apply an averaging argument: given a distribution \(P_{X^n}\), the average of its marginals \(\bar P_X = \frac 1 n \sum P_{X_k}\) will also satisfy the constraint and yield larger mutual information by the concavity of \(I(P_X, P_{Y|X})\) in \(P_X\) (theorem 6.3). Thus \[ \max_{\mathbb E[\sum X_k^2]\leq n \sigma_X^2} I(X^n; X^n+Z^n) = n \max_{\mathbb E[X^2] \leq \sigma_X^2} I(X; X+Z) \] Next, for any orthogonal transformation \(U\in O(n)\), we have \[ I(P_{X^n}, P_{Y^n|X^n}) = I(P_{UX^n}, P_{UY^n|UX^n}) = I(P_{UX^n}, P_{Y^n|X^n}) \] Averaging over all orthogonal rotations \(U\) can only make the mutual information larger, and the optimal input distribution can be chosen to be invariant under orthogonal transformations. The only product distribution satisfying power constraints and rotational symmetry is the isotropic Gaussian, so \(P_{Y^n} = \mathcal N(0, 1+\sigma_X^2)^{\otimes n}\). This also determines \(P_{X^*}\) uniquely.

Proof of (3): apply equation (7.2) and use the convexity of \(P_{Y|X}\mapsto I(P_X; P_{Y|X})\) to argue that the channel additive noise \(N\) must be additive and rotationally symmetric.

Information rates

See the preliminaries section for more details on stochastic processes.

Definition 7.1 (entropy rate) The entropy rate of a random process \(\mathbb X=(X_1, X_2, \cdots)\) is \[ H(\mathbb X) = \lim_{n\to \infty} \dfrac 1 n H(X^n) \]

Theorem 7.3 (properties of stationary processes) Given \(\mathbb X\) stationary,

  1. \(H(X_n|X^{n-1}) \leq H(X_{n-1}|X^{n-2})\).
  2. \(H(X_n|X^{n-1}) \leq H(X^n)/n \leq H(X^{n-1})/(n-1)\).
  3. The entropy rate exists \(H(\mathbb X) = \lim_{n\to \infty} H(X^n)/n = \lim_{n\to \infty} H(X_n|X^{n-1})\) by approximation from above.
  4. If \(\mathbb X\) can be extended to a \(\mathbb Z\)-indexed stationary process, then \(H(\mathbb X) = H(X_1|X^0_{-\infty})\) provided \(H(X_1)<\infty\).
Proof For (1), by conditioning and stationarity \[ H(X_n|X^{n-1}) \leq H(X_n|X_2^{n-1}) = H(X_{n-1}, X^{n-2}) \] For (2), use the chain rule \[\begin{align} \dfrac 1 n H(X^n) &= \dfrac 1 n \sum H(X_j|X^{j-1}) \geq H(X_n|X^{n-1}) \\ H(X^n) &= H(X^{n-1}) + H(X_n|X^{n-1}) \leq H(X^{n-1}) + \dfrac 1 n H(X^n) \end{align}\] For (3), use the right inequality in (2) to argue that \(H(X^n)/n\) is a decreasing sequence lower-bounded by \(0\), so the limit \(H(\mathbb X)\) exists. By the chain rule, the second sequence is the Cesaro’s mean by \(H(X^n)/n = \sum H(X_j|X^{j-1})/n\). For (4), use the limit definition of \(H(\mathbb X)\) and pass the limit through the monotone convergence of mutual information: \[ \lim_{n\to \infty} H(X_1) - H(X_1|X_{-n}^0) = \lim_{n\to \infty} I(X_1; X_{-n}^0) = I(X_1; X_{-\infty}^0) = H(X_1) - H(X_1|X_{-\infty}^0) \]

Examples of stationary processes:

  1. Memoryless source (i.i.d)
  2. A mixed source: given two stationary sources \(\mathbb X, \mathbb Y\) and flip a single biased coin to set \(\mathbb Z=\mathbb X\) or \(\mathbb Y\).
  3. Stationary Markov process: let \(\mathbb X\) be a Markov chain \(X_1\to X_2\to \cdots\) with transition kernel \(K(b|a)\) and initialized with an invariant distribution \(X_1\sim \mu\) so that \(X_2\sim \mu\). Then \(H(X_n|X^{n-1}) = H(X_n|X_{n-1}\) and \[ H(\mathbb X) = H(X_2|X_1) = \sum_{ab}\mu(a)K(b|a) \log \dfrac 1 {K(b|a)} \]

Definition 7.2 (Hidden Markov Model (HMM)) Given a stationary Markov chain \(\cdots, S_{-1}, S_0, S_1, \cdots\) on state space \(\mathcal S\) and a Markov kernel \(P_{X|S}:\mathcal S\to \mathcal X\), a HMM is a stationary process \(\cdots, X_{-1}, X_0, X_1, \cdots\) which is an observation of \(\mathbb S\) with a stationary memoryless channel (emission channel) \(P_{X_1|S_1}\).

Proposition 7.1 (entropy rate of HMM) Given a HMM process \(\mathbb X\) with state process \(\mathbb S\), we have \[ H(X_n|X_1^{n-1}, S_0) \leq H(\mathbb X)\leq H(X_n|X_1^{n-1}) \] Both sides converge monotonically as \(n\to \infty\).

Proof The upper bound is established by theorem 7.3. To establish the lower bound, consider \[\begin{align} H(\mathbb X) &= H(X_n | X_{-\infty}^{n-1}) \geq H(X_n | X_{-\infty}^{n-1}, S_0) = H(X_n|X_1^{n-1}, S_n) \end{align}\] the last equality follows by the HMM assumption. To show that the sequence is increasing, \[\begin{align} H(X_{n+1} | X_1^n, S_0) &= H(X_n | X_0^{n-1}, S_{-1}) \geq H(X_n | X_0^{n-1}, S_{-1}, S_0) \\ &= H(X^n | X_1^{n-1}, S_0) \end{align}\] Finally, to show that this converges to the correct limit, use the mutual information chain rule (theorem 4.2) to identify the two limits. The equation goes to zero by \(I(S_0; X_1^\infty) = I(S_0; X_1^{n\to \infty}) \leq H(S_0) = \infty\). \[\begin{align} I(S_0; X_1^n) - I(S_0; X_1^{n-1}) &= I(X_n; S_0 | X_1^{n-1}) \\ &= H(X_n|X_1^{n-1}) - H(X_n | X_1^{n-1}, S_0)\to 0 \end{align}\]

Definition 7.3 (mutual information rate) The mutual information rate between two processes \((\mathbb X, \mathbb Y)\) is \[ I(\mathbb X, \mathbb Y) = \lim_{n\to \infty} \dfrac 1 n I(X^n; Y^n) \]