3 Kullback-Leibler Divergence

Divergence is a fundamental object in information theory. We begin with KL divergence and proceed later to more general $f$-divergences. Shannon-entropy is related to KL-divergence by equation 3.3. It sheds light on the following relations:

Entropy subadditivity $H(A, B)\leq H(A)+H(B)$ corresponds to super-additivity of divergence when the sampling distribution is factorizable \[ D(P_{AB} \|Q_AQ_B) \geq D(P_AP_B \| Q_AQ_B) = D(P_A\|Q_A) + D(P_B \|Q_B) \] with equality iff $P_{AB} = P_AP_B$ (special case of full chain rule 3.5).
Conditional factorization: the definition of $H(Y|X)$ satisfying \[ H(Y|X) = H(X, Y) - H(X) \] corresponds to the conditional factorization of divergence in 3.2 \[ D(P_{Y|X} \| Q_{Y|X} \, | \, P_X) = D(P_{XY}|Q_{XY}) - D(P_Y \| P_X) \]
Conditioning reduces entropy (recall equivalence to entropy strong subadditivity iff submodularity) \[ H(A|B) \leq H(A|B, C) \] corresponds to conditioning increasing divergence \[ D(P_Y \| Q_Y) \leq D(P_{Y|X} \| Q_{Y|X}\, |\, P_X) \] with equality iff $D(P_{X|Y} \| Q_{X|Y} \, | \, P_Y) = 0$.

Some other ideas:

$D(P\|Q)$ is the un-likelihood of sampling $P$ from $Q$.
It’s insightful to view $P_{XY}$ as generative processes $P_{Y|X}P_X$ by disintegration.
Markov-kernels are randomized functions. A standalone distribution is a randomized function from a singleton domain.
Discrete Markov kernels are transition matrices and functions. Composition $K_{A|B}\circ K_{B|C} = K_{A|C}$ corresponds to the matrix product (einsum) $K^{A|B}_{ab} K^{B|C}_{bc} \mapsto K^{A|C}_{ac}$. Multiplication $K_{A|B} K_{B|1} = K_{AB|1}$ corresponds to the einsum $K^{A|B}_{ab} K^{B|C}_{bc} \mapsto K^{AB|C}_{abc}$. In the function picture, composition is function composition $f, g\mapsto (x\mapsto f(g(x))$, while multiplication corresponds to a more complicated operation.
The singular most important tool for divergence is chain rule 3.2.
To produce conditional-quantity bounds, consider the joint quantity and decompose in $2$ different ways; this underpins the proof of KL-DPI and monotonicity of divergence under conditioning.
The large deviations estimate 3.3 demonstrates how KL gives a uniform (across all events)

Definition

Definition 3.1 (KL-Divergence) Given distributions $P, Q$ on $\mathcal A$ with $Q$ being the reference measure, the (Kullback-Leibler) divergence (relative entropy) between $P, Q$ is \[ D(P\|Q) = \begin{cases} \mathbb E_Q \left[d_QP \log d_QP\right] & P \ll Q \\ +\infty & \text{otherwise} \end{cases} \] By expanding the expectation, the first quantity is seen to be equivalent to \[ \mathbb E_Q \left[d_QP \log d_QP\right] = \mathbb E_P\left[ \log d_QP \right] \] Here $d_QP$ is the Radon-Nikodym derivative, which in the case of standard alphabet is $P(X)/Q(X)$. The relation $P\ll Q$ is read as “$P$ is absolutely continuous w.r.t. $Q$”.

Theorem 3.1 (information inequality) For all $P\ll Q$, $D(P\|Q) \geq 0$, with equality iff $P=Q$.

Proof: Applying Jenson’s inequality to the convex function $\varphi(x)=x\log x$: \[\begin{align} D(P\|Q) = \mathbb E_Q[\varphi(d_QP)] \geq \varphi \mathbb E_Q[d_QP] = \varphi(1) = 0 \end{align}\]

The following corollary shows that minimizing divergence recovers the true distribution.

Corollary 3.1 (minimal entropy recovers true distribution) Given any discrete R.V. $X$ such that $H(X)<\infty$. Then \[ \min_Q \mathbb E_{X\sim P_X} \left[\log \dfrac 1 {Q(X)} \right] = H(X) \] where $Q$ is over valid probability distributions. The unique minimizer is $Q=P_X$.

Proof: Using the previous theorem: \[\begin{align} \mathbb E_Q \left[\log \dfrac 1 {Q(X)} - H(X)\right] &= \mathbb E_{P_X} \left[\log \dfrac{P_X(X)} {Q(X)}\right] = D(P_X\|Q) \end{align}\]

Remark (perspective on KL-divergence). One should think of $D(P\|Q)$ as the un-likelihood of producing the “candidate” distribution $P$ by samping from $Q$.

Definition 3.2 (binary divergence) Binary divergence is defined by \[ d(p, q) = D(\mathrm{Ber}_p\| \mathrm{Ber}_q) = p \log \dfrac p q + \bar p \log \dfrac{\bar p}{\bar q} \]

Differential entropy

The differential entropy generalizes entropy to non-probability measures; it does not have many of the desirable properties of divergence, however (in particular, lack of invariance under bijective transform).

Definition 3.3 (differential entropy) The differential entropy of a random vector $X$ is \[ h(X) = -D(P_X\|\mathrm{Leb}) \] where $\mathrm{Leb}$ is the Lebesgue measure (just think of it as the constant $1$ everywhere). In particular, if $X$ has probability $P_X$, then \[ h(X) = \mathbb E_{P_X}\left[\log P_X(x)\right] \]

Theorem 3.2 (properties of differential entropy)

Uniform distribution maximizes $h$: given $\mathrm{Pr}(X\in S\subset \mathbb R^n)=1$, then $h(X^n \leq \log \mathrm{Leb}(S)$ with equality given by uniform.
Linear transform: $h(AX+c) = h(X) + \log|\det A|$ for invertible $A$.
Conditioning reduces entropy: $h(X|Y) \leq h(X)$.

Proposition 3.1 (differential entropy of Gaussian) Recall the multivariate Gaussian pdf: \[ f(x) = \dfrac{1}{\sqrt{(2\pi )^d |\Sigma|}} \exp \left[-\dfrac 1 2 (x-\mu)^T \Sigma^{-1}(x-\mu)\right] \] the differential entropy of a multivariate Gaussian $X=\mathcal N(\mu, \Sigma)$ is \[ h(X) = \dfrac 1 2 \log[(2\pi e)^d |\Sigma|] \]

Proof: Direct computation: logarithm of the constant term yields $\dfrac 1 2 \log[(2\pi )^d |\Sigma|]$, while the exponential term yields \[\begin{align} \mathbb E\left[ (x-\mu)^T \Sigma^{-1}(x-\mu) \right] &= \mathbb E\mathrm{tr}\left[ (x-\mu)(x-\mu)^T \Sigma^{-1} \right]\\ &= \mathrm{tr}\left( \mathbb E[(x-\mu)(x-\mu)^T]\Sigma^{-1} \right) = \mathrm{tr}(I) = d \end{align}\]

Recall that for semidefinite matrices, the positive semidefinite order is \[ A\prec B \iff B - A\text{ semi-definite.} \]

Theorem 3.3 (entropy extremality under covariance constraint) For any $d\times d$ covariance $\Sigma$, differential entropy is saturated by multivariate Gaussian \[ \max_{\mathrm{Cov}(X) \preceq \Sigma} h(X) = h(\mathcal N(0, \Sigma)) = \dfrac 1 2 \log[(2\pi e)^d |\Sigma|] \] Expected power constraint is saturated by independent Gaussian \[ \max_{\mathbb E[\|X\|^2] \leq a} h(X) = h\left(\mathcal N\left(0, \dfrac a d I_d\right)\right) = \dfrac d 2 \log \dfrac{2\pi e a}{d} \]

Proof: let $\mathbb E[X]=0$ w.l.g and $X_G = \mathcal N(0, \Sigma)$ with pdf $P_G$; apply information inequality \[\begin{align} 0 &\leq D(P_X \| X_G) = \mathbb E_P \left[\log \dfrac{P_X(x)}{P_G(x)}\right] = \mathbb E_P \left[\log P_X(x)\right] - \mathbb E_P \left[\log P_G(x)\right] \\ &= -h(X) - \dfrac 1 2 \log[(2\pi)^d |\Sigma|] + \dfrac{\log e} e \mathbb E_P[X^T\Sigma^{-1}X] \leq -h(X) + h(X_G) \end{align}\] On the last step, we apply the usual trick $\mathbb E_P[X^T\Sigma^{-1}X] = \mathrm{tr}\, \mathbb E[\Sigma_X \Sigma^{-1}] \leq \mathrm{tr}(I) = d$.

Corollary 3.2 $\Sigma \mapsto \log \det \Sigma = \mathrm{tr}\log \Sigma$ is concave on the space of real positive definite $n\times n$ matrices.

Proof: Let $\lambda\sim \mathrm{Ber}(1/2)$ and $X = \lambda \mathcal N(0, \Sigma_1) + \bar \lambda \mathcal N(0, \Sigma_2)$, convexity follows from $h(X) \leq (X|\lambda)$.

Channel, conditional divergence

Information theorists see Markov kernels everywhere!

Definition 3.4 (Markov kernel (channel)) A Markov kernel is a function $K(-|-)$, whose first argument is a measurable subset of $\mathcal Y$ and the second an element of $\mathcal X$ such that

$E\to K(E|x)$ is a probability measure for every $x\in \mathcal X$.
$x\to K(E|x)$ is measurable.

A markov kernel is the concept of a randomized function, where an input $x$ results in a random measure (distribution) on $\mathcal Y$.

Common operations include:

Joint multiplication: Maps $P_{Y|X} P_X\mapsto P_{X, Y}$.
Composition: a probability distribution $P_X$ on $\mathcal X$ and $K:\mathcal X\to \mathcal Y$, then one can consider joint distribution $P_X\times K$ on $\mathcal X\times \mathcal Y$ where \[ P_{X, Y}(x, y) = P_X(x) K(\{y\}|X) \] This corresponds to matrix multiplication in the transition-matrix picture and function composition in the function picture. It is also the partial trace of joint-multiplication.
(Tensor / Cartesian) product: $P_X, P_Y\mapsto P_X \times P_Y$. We also overload this with joint multiplication: for example: $P_{Y|X}P_{Z|X}P_X$ should be understood as $(P_{Y|X}\times P_{Z|X})P_X$.
Disintegration (standard Borel spaces): every $P_{X, Y}$ on $\mathcal X\times \mathcal Y$ can be decomposed into \[ P_{X, Y} = P_X\times P_{Y|X} \]

Definition 3.5 (Binary symmetric channel) The channel $\mathrm{BSC}_\delta\{0, 1\}\to \{0, 1\}$ is defined as \[ Y = (X + Z)\mod 2, \quad Z\sim \mathrm{Ber}(\delta) \] This has the transition matrix in basis $\{0, 1\}$: \[ \begin{pmatrix} 1 - \delta & \delta \\ \delta & 1 - \delta \end{pmatrix} \]

Matrix multiplication shows that $\mathrm{BSC}_\delta \circ \mathrm{BSC}_\delta = \mathrm{BSC}_{2\delta \bar \delta} = \mathrm{BSC}_{\delta \ast \bar \delta}$.

We will next see that the joint divergence can be written as the sum of marginal and conditional divergences; the latter is defined below:

Definition 3.6 (Conditional divergence) Given distribution $P_X$ and two markov kernels $P_{Y|X}, Q_{Y|X}$, the divergence between $P, Q$ given $P_X$ is defined as \[ D(P_{Y|X}\|Q_{Y|X}\,|\, P_X) \equiv \mathbb E_{x_0\sim P_X} \left[ D(P_{Y|X=x_0} \| Q_{Y|X=x_0}) \right] = \mathbb E_{X, Y\sim P_{Y|X}P_X} \left[\log \dfrac{P_{Y|X}}{Q_{Y|X}}\right] \]

Chain Rule, DPI

Proposition 3.2 (chain rule)

The joint divergence is (1) the sum of marginal plus conditional divergences, and (2) the sum of marginal divergence and joint divergence upon replacing the source of $Q_{Y|X}$ with $P_X$.

\[ D(P_{X, Y}\|Q_{X, Y}) = D(P_X, Q_X) + D(P_{Y|X}\|Q_{Y|X} \, |P_X) = D(P_X, Q_X) + D(P_{XY} \| Q_{Y|X}P_X) \]

Proof: Expand the definition and factor joint into conditionals \[\begin{align} D(P_{X, Y}\|Q_{X, Y}) &= \mathbb E_P\left[\log \dfrac{P_{XY}}{Q_{XY}}\right] = \mathbb E_P\left[\log \dfrac{P_{Y|X}P_X}{Q_{Y|X}Q_X}\right] \\ &= \mathbb E_P \left[\log \dfrac{P_{Y|X}}{Q_{Y|X}}\right] + \mathbb E_P \left[\log \dfrac{P_X}{Q_X}\right] \\ &= D(P_{Y|X}\|Q_{Y|X}\, | P_X) + D(P_X\|Q_X) \end{align}\] The second equality follows from the first equality (see property 1 below).

Remark (care in factoring Q). In the expression above, note that $Q_{XY}=Q_{Y|X}Q_X$ should be factored according to the marginal of $Q$, not $P$. This is important in the proof of the Kolmogorov identities in theorem 4.2, where \[ P_{XY}P_Z \text{ factored against $XZ$} = P_{Y|X} \]

Almost every property of Shannon entropy has a counterpart in KL-divergence. The following relation provides some intuition as to why:

Proposition 3.3 (entropy-divergence relation) Given a distribution $P$ supported on a finite set $\mathcal A$ \[ H(P) = \log |\mathcal A| - D(P\|U_{\mathcal A}) \] where $U_{\mathcal A}$ is the uniform distribution of $\mathcal A$. Given $X, Y$ on $\mathcal A, \mathcal B$, the conditional entropy is written as \[ H(Y|X) = \log |\mathcal B| - D(P_{Y|X} \| U_{\mathcal B} | P_X) \]

Proof The first claim follows by direct computation: let $n=|\mathcal A|$, then \[ \log |\mathcal A| - D(P\|U_{\mathcal A}) = \log n - \mathbb E_{x\sim P} \log n P(x) =\mathbb E_{x\sim P} \left[\log n + \log \dfrac 1 {n P(x)}\right] =H(X) \] Let $m=|\mathcal B|$, the second claim follows from \[\begin{align} H(Y|X) &= H(X, Y) - H(X)\\ &= \log (nm) - \log n - \left[ D(P_{XY} \| U_{\mathcal A\times \mathcal B}) - D(P_X \| U_{\mathcal A}) \right]\\ &= \log |\mathcal B| - D(P_{Y} \| U_{\mathcal B}\, | U_{\mathcal A}) \end{align}\]

Theorem 3.4 (properties of divergence) Given standard Borel $\mathcal X, \mathcal Y$, then

Unconditional expression for divergence $D(P_{Y|X}\|Q_{Y|X}\, |\, P_X) = D(P_{Y|X}P_X \| Q_{Y|X} P_X)$.
Monotonicity: $D(P_{X, Y}\|Q_{X, Y}) \geq D(P_Y\|Q_Y)$ (follows from chain rule + information inequality).
Full chain rule: $D(P_{X_1\cdots X_n} \| Q_{X_1\cdots X_n}) = \sum_{j=1}^n D( P_{X_j|X_1, \cdots, X_{j-1}} \| Q_{X_j|X_1, \cdots, X_{j-1}} \, | \, P_{X_1, \cdots, X_{j-1}} )$.
Tensorization : $D\left(\prod P_{X_j} \| \prod Q_{X_j} \right) = \sum D(P_{X_j} \| Q_{X_j})$

Proof: The unconditional expression follows from chain rule 3.2 \[ D(P_{Y|X} P_X\|Q_{Y|X} P_X) = D(P_{Y|X} P_X\|Q_{Y|X} P_X) - D(P_X\|P_X) = D(P_{Y|X} P_X\|Q_{Y|X} P_X) \] The full chain rule follows from inductive application of the chain rule. For tensorization, see proposition 3.5 below.

Proposition 3.4 (conditioning increases divergence) Given $P_{Y|X}, Q_{Y|X}, P_X$, we have \[ D(P_{Y|X}\circ P_X \| Q_{Y|X} \circ P_X) \leq D(P_{Y|X} \| Q_{Y|X} \, | \, P_X) = D(P_{Y|X} P_X \| Q_{Y|X} P_X) \] Equality is saturated iff $D(P_{X|Y} \| Q_{X|Y} \, |\, P_Y) = 0$.

Proof: Written in this form, this is apparant since $A\circ B$ loses information from the joint $AB$. To see this explicitly, let \[ P_Y = P_{Y|X} \circ P_X, \quad Q_Y = Q_{Y|X} \circ P_X, \quad P_{XY} = P_XP_Y, \quad Q_{XY} = Q_XQ_Y \] Using the chain rule yields \[\begin{align} D(P_{XY} \| Q_{XY}) &= D(P_{X|Y} \| Q_{X|Y} \, | \, P_Y) + D(P_Y \| Q_Y) \\ &= D(P_{Y|X} \| Q_{Y|X} \, | \, P_X) + D(P_X \| Q_X)_{=0} \end{align}\] Here $D(P_X \| Q_X)=0$; the equality condition can also be seen.

Proposition 3.5 (chain rule: independent sampling distribution) Consider the chain rule with independent $Q$’s, then \[ D(P_{X_1\cdots X_n} \|Q_{X_1}\cdots Q_{X_n}) = D(P_{X_1\cdots X_n} \| P_{X_1}\cdots P_{X_n}) + \sum_{j=1}^n D(P_{X_j} \|Q_{X_j}) \geq \sum_{j=1}^n D(P_{X_j} \|Q_{X_j}) \] with equality saturated iff $P$ is factorizable.

Proof: Use the second equality in the chain rule 3.2 inductively to swap out $Q_{X_j}$.

Theorem 3.5 (data-processing inequality (DPI)) Given a Markov kernel $K:\mathcal X\to \mathcal Y$ and a chain $(P_X, Q_X) \xrightarrow{K} (P_Y, Q_Y)$ so that $P_Y = K_{Y|X} \circ P_X, \quad Q_Y = K_{Y|X} \circ Q_X$, then processing reduces the ability to distinguish between the distributions \[ D(P_X\|Q_X) \geq D(P_Y\|Q_Y) = D(K_{Y|X} \circ P_X \| K_{Y|X} \circ Q_X) \]

Proof: Decompose the joint in $2$ different ways: \[\begin{align} D(P_{XY}\|Q_{XY}) &= D(P_{X|Y} \| Q_{X|Y} |P_Y)_{\geq 0} + D(P_Y \| Q_Y) \\ &= D(P_{Y|X} \| Q_{Y|X}|P_X)_{=0} + D(P_X\|Q_X) \end{align}\] Using nonnegativity on the first line and equality of the channel on the second line yields that the processing inequality is saturated iff the reverse inference distribution is the same: \[ D(P_Y\|Q_Y) \leq D(P_X\|Q_X) \text{ with equality } \iff D(P_{X|Y} \| Q_{X|Y} |P_Y)_{\geq 0} \]

Corollary 3.3 (large deviations estimate) For any subset $E\subset \mathcal X$ we have \[ D(P_X \| Q_X) \geq d(P_X[E] \| Q_X[E]) \]

Proof: Apply to the binary-output channel $1_E$.