11 Hypothesis Testing, Large Deviations

Key takeaways:

  1. The Neyman-Pearson region is reducible to the closure of convex hull of deterministic tests 11.1. Use the layer-cake representation to argue convexity.
  2. The Neyman-Pearson lemma fundamentally follows from the convexity of \(\mathcal R(P, Q)\) and that the supporting lines of are given by likelihood tests.
  3. Stein’s and Chernoff’s asymptotic regimes.
  4. The cumulant-generating function shows up naturally as the normalization factor in tilting, and extremizing the tilting formula leads naturally to the rate function (Legendre conjugate of the CGF).
    • The one-parameter family of tilted families form the natural exponential families.
  5. A sharper version of Chernoff’s bound for mean deviations (equation (11.1)).
  6. Crucial bridge between large-deviation estimate and information projection (theorem 11.9).
  7. For any event \(E\) with \(P_X[E]>0\), we have \(\log \frac 1 {P_X(E)} = D(P_{X|X\in E} \| P_X)\)

Neyman-Pearson

This part corresponds to chapter 14 of the textbook.

Definition 11.1 (NP formulation of BHT) Consider two distributions \(P, Q\) on \(\Omega\), one of which generated our observation \(X\). The two hypotheses are \(H_0:X\sim P\) and \(H>_1:X\sim Q\). Let \(Z=\{0, 1\}\) denote accepting and rejecting the null, respectively.

  • A test is a (possibly random) decision rule.
    • It is deterministic if of form \(f:\mathcal X\to \{0, 1\}\); which partitions the event space.
    • A randomized test is specified by a kernel \(P_{Z|X}:\mathcal X\to \{0, 1\}\) so that \(P_{Z|X}(1|x)\in [0, 1]\) is the probability of rejecting the null upon observing \(X=x\).
  • \(\alpha = \pi_{0|0} = P[Z=0]\): probability of success given \(H_0\) is true.
    • \(1 - \alpha\) is the false positive rate.
  • \(\beta = \pi_{0|1} = Q[Z=0]\): probability of error given \(H_1\) is true.
    • \(1-\beta\) is the power.

Definition 11.2 (Neyman-Pearson region) Given \((P, Q)\), the Neyman-Pearson region consists of achievable points for all randomized tests \[ \mathcal R(P, Q) = \{(P[Z=0], Q[Z=0]):\forall P_{Z|X}: \mathcal X\to \{0, 1\}\}\subset [0, 1]^2. \] Fixing \(\alpha\), we wish to obtain the lowest \(\beta\), so the lower boundary is defined by \[ \beta_\alpha(P, Q) = \inf_{P[Z=0]\geq \alpha} Q[Z=0] \] Each deterministic test corresponds to a measurable subset \(E\), so the deterministic Neyman-Pearson region is specified by \[ \mathcal R_{\mathrm{det}}(P, Q) = \{(P[E], Q[E]): E\text{ measurable}\}. \]

Note that \(\mathcal R(P, Q)=[0, 1]^2\) if \(P\perp Q\), while it’s the zero-measure diagonal if \(P=Q\).

Proposition 11.1 (properties of the NP region)

  1. \(\mathcal R(P, Q)\) is closed and convex.
  2. \(\mathcal R(P, Q)\) contains the diagonal.
  3. \(\mathcal R(P, Q)\) is symmetric about \(\alpha=\beta\).
Proof Given \((\alpha_0, \beta_0), (\alpha_1, \beta_1)\in \mathcal R(P, Q)\) corresponding to tests \(P_{Z_0|X}, P_{Z_1|X}\), and randomizing between the two tests yield the desired convexity. Closeness is established later. Testing by random guessing \(Z\sim \mathrm{Ber}(1-\alpha)\perp X\) achieves \((\alpha, \alpha)\). To establish symmetry, \(P_{1-Z|X}\) achieves \((1-\alpha, 1-\beta)\).

Theorem 11.1 (reduction to deterministic tests) \(\mathcal R(P, Q) = \overline{\mathrm{co}(\mathcal R_{\mathrm{det}}(P, Q))}\). Consequently, if \(P, Q\) are on a finite alphabet \(\mathcal X\), then \(\mathcal R(P, Q)\) is a polygon of at most \(2^{|X|}\) vertices.

Proof To prove the nontrivial direction \(\subset\), given any randomized \(P_{Z|X}\), define \(g:\mathcal X\to [0, 1]\) by \(g(x) = P_{Z|X}(0|X)\), the \[\begin{align} P[Z=0] &= \sum_x g(x)P(x) = \mathbb E_P[g] = \int_0^1 P[g\geq t ]\, dt \\ Q[Z=0] &= \mathbb E_Q[g] = \int_0^1 Q[g\geq t]\, dt \end{align}\] The last inequality holds because for any nonnegative r.v, it has an integral representation in terms of indicator r.v’s \(1_{U\geq t}\) by \(U = \int_0^\infty 1_{U\geq t}\, dt\) so \[\begin{align} \mathbb E[U] = \mathbb E\left[\int_0^\infty 1_{U\geq t}\, dt\right] = \int_0^\infty \mathbb E[1_{U\geq t}]\, dt \end{align}\] Thus \((P[Z=0], Q[Z=0])\) is a mixture of points \((P[g\geq t], Q[g\geq t])\in \mathcal R_{\mathrm{det}}\) thus in the closure of the convex hull.

We define the extended log-likelihood ratio (LLR) \(T(x)\) by extending \(\log p(x)/q(x)\) in the usual manner to \(\pm \infty\) if \(q, p=0\) respectively and \(0\) if both are \(0\).

Definition 11.3 ((extended) log-likelihood ratio (LLR)) Given \(P, Q\), define LLR by \[ T(x) = \begin{cases} 0 & q(x) = p(x) = 0 \\ \infty & q(x) = 0 \\ -\infty & p(x) = 0 \\ \log \dfrac{p(x)}{q(x)} \end{cases} \]

Theorem 11.2 (properties of LLR, change of measures)

  1. Fixing \(t\in \mathbb R\), \(Q[T=t] = e^{-t} P[T=t]\). In other words, on the common support of \(P, Q\) we obtain \(dQ = e^{-T} dP\).
  2. The expectation value of \(h:\mathcal X\to \mathbb R\) on the common support of \(P, Q\) under the two distributions are related by \[\begin{align} \mathbb E_Q\left[h \cdot 1_{T> -\infty}\right] = \mathbb E_P[h\cdot e^{-T}], \quad \mathbb E_P[h \cdot 1_{T<+\infty}] = \mathbb E_Q[h \cdot e^T] \end{align}\]
  3. For any \(f\geq 0\) and \(\tau\in \mathbb R\) we have \[\begin{align} \mathbb E_Q[f \cdot 1_{T\geq \tau}] \leq e^{-\tau} \mathbb E_P[f\cdot 1_{T\geq \tau}], \quad e^{-\tau} \mathbb E_P[f\cdot 1_{T\leq \tau}] \leq \mathbb E_Q[f\cdot 1_{T\leq \tau}] \end{align}\]
Proof To prove (1), compute \[\begin{align} Q[T=t] &= \sum Q(x) 1\left[\log \dfrac{P(x)}{Q(x)} = t\right] = \sum Q(x) 1\left[e^t Q(x) = P(x)\right] \\ &= e^{-t} \sum P(x) 1[e^t Q(x)=P(x)] = e^{-t}P[T=t] \end{align}\] From the definition, note that \(Q[T=\infty]=P[T=-\infty]=0\). Then we can w.l.o.g. operate on the common support \(R\) of \(P, Q\), on which \(dQ = e^{-T}\, dP\) \[ \mathbb E_Q[h\cdot 1_{T>-\infty}] = \int_R h\, dQ = \int_R e^{-T} h\, dP = \mathbb E_P [h\cdot e^{-T}] \] The second equation is proved similarly. Onto (3), for the first inequality, substitute \(h\mapsto f\cdot 1_{T\geq \tau}\) and \(h\mapsto f\cdot 1_{T\leq \tau}\) to part (2) to obtain \[\begin{align} \mathbb E_Q[f\cdot 1_{T\geq \tau}] &= \mathbb E_P[f\cdot 1_{T\geq \tau} e^{-T}] \leq e^{-\tau} \, \mathbb E_P[f\cdot 1_{T\geq \tau}] \\ \mathbb E_P[f\cdot 1_{T\leq \tau}] &= \mathbb E_Q[f\cdot 1_{T\leq \tau} e^T] \leq e^\tau \mathbb E_Q[f\cdot 1_{T\leq \tau}] \end{align}\]

Definition 11.4 (likelihood ratio test) The LRT with threshold \(\tau\in \mathbb R\cup {\pm \infty}\) is defined by \(\mathrm{LRT}_\tau(x) = 1_{T(x)>\tau}\).

Corollary 11.1 (sufficiency) \(T\) is sufficient statistic for testing \(P\) versus \(Q\).

It suffices to prove that \(P_{X|T}=Q_{X|T}\). Note that \(P_{T|X}=Q_{T|X}\), in which case \[\begin{align} P_{X|T}(x|t) &= \dfrac{P_X(x) P_{T|X}(t|x)}{P_T(t)} = \dfrac{P_X(x) 1[dP/dQ = e^t]}{P_T(t)} \\ &= \dfrac{e^t Q(x) 1[dP/dQ = e^t]}{P_T(t)} = \dfrac{Q_{XT}(x, t)}{e^{-t}P_T(t)} = Q_{X|T}(x|t) \end{align}\]

We proceed to providing two bounds on \(\mathcal R(P, Q)\):

  1. Converse (outer) bounds: any point in \(\mathcal R(P, Q)\) must satisfy certain constraints.
  2. Achievability (innder) bounds: points satisfying certain constraints belong to \(\mathcal R(P, Q)\).

The following result provides an “envelope” for \(\mathcal R(P, Q)\) and, in fact, applies to any \(f\)-divergence. It follows from simply applying data-processor \(P_{Z|X}\).

Theorem 11.3 (weak converse) \(\forall (\alpha, \beta)\in \mathcal R(P, Q)\), we obtain \[ d(\alpha \| \beta) \leq D(P\|Q), \quad d(\beta \| \alpha) \leq D(Q\|P) \]

Lemma 11.1 Given any test \(Z\) and \(\gamma>0\), we obtain \[ P[Z=0] - \gamma Q[Z=0] \leq P[T>\log \gamma] \iff \alpha - \gamma \beta \leq P[T>\log \gamma] \]

Proof Let \(\tau = \log \gamma\), apply theorem 11.2 to obtain \[ P[Z=0, T\leq \tau] \leq \gamma Q[Z=0, T\leq \tau] \] Decomposing the marginal \(P[Z=0]=P[Z=0, T\leq \tau] + P[Z=0, T>\tau]\) yields \[ P[Z=0] - \gamma Q[Z=0] \leq P[Z=0, T>\tau] - Q[Z=0, T>\tau] \leq P[T>\tau] \]

Applying this lemma to \((P, Q, \gamma)\) and \((P, Q, 1/\gamma)\) yields the strong converse. It effectively states that \(\mathcal R(P, Q)\) is contained in the intersection of an infinite collection of halfplanes indexed by \(\gamma\). Compared to the weak converse, we need to know the CDF of \(T\) compared to just the expectation (given by the divergence).

Theorem 11.4 (strong converse) \(\gamma>0, \alpha - \gamma \beta \leq P[T>\log \gamma]\) and \(\beta - \alpha/\gamma \leq Q[T<\log \gamma]\).

Proof Proceeding to achievability, a convex set can be efficiently dscribed by its supporting hyperplanes. Characte>rizing \(\mathcal R(P, Q)\) is thus equivalent to solving, for each \(t>0\), \[ \min_{(\alpha, \beta)\in \mathcal R(P, Q)} t\beta - \alpha. \] To see this, this is looking for the minimal \(y\)-intercept \(c\) of \((1, -t)\) such that \(t\beta = \alpha + c\) for some \((\alpha, \beta) \in \mathcal R(P, Q)\). This is equivalent to minimizing weighted probability of error with \((1-\alpha, \beta)\) weighted by \((1, t)\). To solve this, \[\begin{align} \alpha^* - t\beta^* = \max_{P_{Z|X}} \sum_{x\in \mathcal X} \left[ P(x) - t Q(x) \right] P_{Z|X}(0|X) = \sum_{x\in \mathcal X} \max[0, P(X) - tQ(X)] \end{align}\] The last equality follows from the obvious choice \[ P_{Z|X}^*(0|x) = 1\left\{ P(x) \geq tQ(x) \right\} = 1_{T\geq \log t}. \]

Theorem 11.5 (Neyman-Pearson lemma) For each \(\alpha, \beta_\alpha\) is attained by the test \[ P_{Z|X}(0|x) = \begin{cases} 1 & T>\tau \\ \lambda & T=\tau \\ 0 & T < \tau \end{cases}, \quad \alpha = P[T>\tau] + \lambda P[T=\tau]. \]

Proof Fixing \(\tau\in \mathbb R\), let \(t=e^\tau\). Given any test \(P_{Z|X}\), write \(g(x) = P_{Z|X}(0|x)\) be the claimed probability of the null. We wish to show that \[ \alpha = E_P[g] = P[T>\tau] + \lambda P[T=\tau] \implies \beta = \mathbb E_Q[g] \geq Q[T>\tau] + \lambda Q[T=\tau]. \] Apply theorem 11.2 twice yields \[\begin{align} \beta &= \mathbb E_Q[g\cdot 1_{T>\tau}] + \mathbb E_Q[g\cdot 1_{T\leq \tau}] \geq \mathbb E_Q[g\cdot 1_{T>\tau}] + \dfrac 1 t \mathbb E_P[g\cdot 1_{T\leq \tau}] \\ &= \mathbb E_Q[g\cdot 1_{T>\tau}] + \dfrac 1 t \left( \mathbb E_P\left[(1-g) 1_{T>\tau}\right] + \lambda P[T=\tau] \right) \\ &\geq \mathbb E_Q[g\cdot 1_{T>\tau}] + \mathbb E_Q[(1-g) 1_{T>\tau}] + \lambda Q[T=\tau] \\ &= Q[T>\tau] + \lambda Q[T=\tau] \end{align}\] Inspecting the saturation conditions yield the LLR test as claimed.

Definition 11.5 (asymptotic regimes) For large-sample i.i.d asymptotics, we are interested in the following two questions:

  1. Stein’s regime: conditioning on \(\pi_{1|0}\leq \epsilon\), what is the best decay rate for \(\pi_{0|1}\)?
  2. Chernoff’s regime: when \(\pi_{1|0}, \pi_{0|1}\) both vanish exponentially, what is the optimal tradeoff between their exponents?

Definition 11.6 (Stein's exponent) The \(\epsilon\)-optimal exponent in Stein’s regime is \[ V_\epsilon = \sup[ E: \exists n_0:\forall n\geq n_0, \exists P_{Z|X^n} \text{ such that } \alpha > 1 - \epsilon, \beta < e^{-nE} ] \] > Stein’s exponent is \(V=\lim_{\epsilon\to 0}V_\epsilon\).

Proposition 11.2 (equivalent definition of Stein's exponent) \[ V_\epsilon = \liminf_{n\to \infty} \dfrac 1 n \log \dfrac 1 { \beta_{1-\epsilon}(P_{X^n}, Q_{X^n}) } \]

Theorem 11.6 (Stein's lemma) \(V_\epsilon = D(P\|Q)\) for all \(\epsilon \in (0, 1)\).

Large deviations theory

This part corresponds to chapter 15 of the textbook.

Consider an i.i.d sequence \(X_1, \dots, X_n\sim P\) and \(\hat P_n\) their empirical distribution. We define the cumulant-generating function \[ \psi_X(\lambda) = \log \mathbb E[e^{\lambda X}] \] The Legendre transform \(\psi_X^*=E\) is also known as the rate function: \[ \psi_X^*(\lambda) = \sup_\lambda \lambda \gamma - \psi_X(\lambda) \] We assume for regularity that \(\psi_X\) is finite everywhere. This is known as Cramer’s condition.

Definition 11.7 (CGF properties) Under Cramer’s condition, we obtain

  1. \(\psi_X\) is convex.
  2. \(\psi_X\) continuous.
  3. \(\psi_X\) smooth (infinitely differentiable) and \[ \psi'_X(\lambda) = \dfrac{\mathbb E[X e^{\lambda X}]}{\mathbb E[\exp(\lambda X)]} = e^{-\psi_X(\lambda)}\mathbb E[X e^{\lambda X}] \] In particular, \(\psi'_X(0) = \mathbb E[X]\).
  4. If \(a\leq X\leq b\) a.s. then \(a\leq \psi'_X\leq b\). Conversely, \(\inf \psi'_X \leq X\leq \sup \psi'_X\) a.s.
  5. If \(X\) is not a constant, then \(\psi_X\) is strictly convex.
Proof (partial)
  1. Fix \(\theta\in (0, 1)\), define the \(L_p\) norm of a r.v. by \(\|U\|_p = \left(\mathbb E|U|^p\right)^{1/p}\). Apply Holder’s inequality \(\mathbb E|UV| \leq \|U\|_p\|V\|_q\) with \((p, q) = (1/\theta, 1/\bar \theta)\) yields \[\begin{align} \mathbb E|e^{\lambda_1/p + \lambda_2/q}| \leq \| e^{\lambda_1 X/p}\|_p \| e^{\lambda_2 X/q}\|_q = \mathbb E[e^{\lambda_1 X}]^\theta \mathbb E[e^{\lambda_x X}]^{\bar \theta} \end{align}\] Take the logarithm on both sides to obtain complexity.
  2. A convex function must be continuous on the interior of its domain.
  3. To first demonstrate that \(\mathbb E|X e^{\lambda X}|\) exists, note that \(e^{|X|} \leq e^X + e^{-X}\), then \[ |X e^{\lambda X}| \leq e^{(\lambda+1)X} \leq e^{(\lambda+1)X} + e^{-(\lambda + 1)X} \] both quantities are absolutely integrable. Next, \(u\mapsto \mathbb E|X e^{uX}|\) is integrable on \([0, \lambda]\), then applying Fubini yields \[\begin{align} e^{\psi_X(\lambda)} &= \mathbb E[e^{\lambda X}] = \mathbb E\left[ 1 + \int_0^\lambda X e^{uX}\, du \right] = 1 + \int_0^\lambda \mathbb E[X e^{uX}]\, du \\ \psi'_X(\lambda) e^{\psi_X(\lambda)} &= \mathbb E[X e^{\lambda X}] \end{align}\]
  4. \(A\leq X\leq B\implies \psi'_X(\lambda) \in [A, B]\) by applying part (3).
  5. The converse of (4) and (5) omitted.

Lemma 11.2 (Chernoff bound) Given i.i.d \(X^n\) and \(\lambda \geq 0\) and let \(\bar X\) denote the empirical mean. Let \(\mathbb E[X]<\gamma<\mathrm{esssup}[X]\), then \[ \mathrm{Pr}\left[ \bar X \geq \gamma \right] \leq \exp \left[-n\lambda \gamma + n \psi_X(\lambda)\right] \]

Proof Apply Morkov’s inequality \(\mathrm{Pr}[X\geq a]\leq \dfrac 1 a \mathbb E[X]\) after exponentiating the inequality; note that \(\log \mathbb E[\exp(n\lambda \bar X)] = \log \phi_{X^n}(\lambda) = n\psi_X(\lambda)\). \[\begin{align} \mathrm{Pr}[n\bar X\geq n\gamma] &= \mathrm{Pr}[\exp(n\lambda \bar X) \geq \exp(n\lambda \gamma)] \\ &\leq \exp(-n\lambda \gamma) \mathbb E[\exp(n\lambda \bar X)] \\ &= \exp \left(-n\lambda \gamma + n\psi_X(\lambda)\right) \end{align}\]

Theorem 11.7 (large-sample mean deviations) Given a r.v \(X\) whose log-MGF \(\psi_X\) is finite everywhere. Let \(B=\mathrm{esssup}\, X\) and \(\mathbb E[X]<\gamma<B\), then \[ \mathrm{Pr}\left[\dfrac 1 n \sum_{j=1}^n X_j \geq \gamma\right] = \exp \left[-nE(\gamma) + o(n)\right] \] where \(\lambda_X^*(\gamma) = \sup_{\lambda \geq 0} \lambda \gamma - \psi_X(\lambda)\) is the rate function.

Proof (non-asymptotic upper-bound) Recalling the Chernoff bound (lemma 11.2), we obtain \(\mathrm{Pr}[\bar X\geq \gamma] \leq \exp[-n\lambda \gamma + n\psi_X(\lambda)]\). Optimize over \(\lambda\) naturally yields the Legendre transform \(\psi_X^*\) which is non-asymptotic: \[ \mathrm{Pr}[\bar X\geq \gamma] \leq \exp[-n \psi_X^*(\gamma)] \tag{11.1} \]

Definition 11.8 (tilting) Given \(X\sim P\) and \(\lambda \in \mathbb R\), the tilted measure \(P_\lambda\) is defined by \[ P_\lambda(dx) = \dfrac{e^{\lambda x}}{\mathbb E[e^{\lambda x}]} P(dx) = e^{\lambda x - \psi_X(\lambda)} P(dx) \]

Fixing \(P\), the one-parameter family of tilted distributions \(P_\lambda\) are exactly the natural exponential families. In particular, the follow results follow from straightforward calculation:

Theorem 11.8 (properties of NEF)

  1. \(\psi_{P_\lambda}(u) = \psi_X(\lambda + u) - \psi_X(\lambda)\).
  2. Tilting trades mean for divergence: \[ \lambda>0\implies \mathbb E_{P_\lambda}[X] = \psi'_X(\lambda) \geq \mathbb E_P[X] \] The same is true with \(>\) replaced by \(<\). In terms of divergence, \[ D(P_\lambda \|P) = (\psi_X^*\circ \psi_X')(\lambda) = \psi_X^*(\mathbb E_{P_\lambda}[X]) \]
  3. \(\mathrm{Var}_{P_\lambda}(X) = \psi_X''(\lambda)\log e\).

Theorem 11.9 (large-deviation estimate is related to information projection) Given \(X_1, \dots\sim P\) i.i.d, then for any \(\gamma\in \mathbb R\): \[ \lim_{n\to \infty} \dfrac 1 n \log \dfrac 1 {P[\bar X > \gamma]} = \inf_{Q:\mathbb E_Q[X]>\gamma} D(Q\|P) \] The same holds for \(>\mapsto \geq\). For every \(n\) we obtain the firm upper bound \[ P[\bar X_n \geq \gamma] \leq \exp \left[-n \inf_{\mathbb E_Q[X]\geq \gamma} D(Q\|P)\right]. \]

Proof (strict deviation case)

The inequalities hold trivially if events have zero probability. Let \(P[E_n]\) be the probability that the empirical mean with \(n\) samples is greater than \(\gamma\).

To prove the lower bound, fix \(Q\) such that \(\mathbb E_Q[X]>\gamma\). Start by applying DPI \[\begin{align} -\log 2 + Q[E_n] \log \dfrac 1 {P[E_n]} &\leq d(Q[E_n]\|P[E_n]) \leq D(Q_{X^n}\| P_{X^n}) = nD(Q\|P) \\ \log P[E_n] &\leq \dfrac{-nD(Q\|P) - \log 2}{Q[E_n]} \end{align}\] Also note that by WLLN, \(Q[E_n] = 1 - o(1)\). Optimize over \(Q\) to obtain (here \(\limsup\) creeps in in order to combat the \(-\log2\) and \(Q[E_n]\approx 1\) factors. \[ \limsup_{n\to \infty} \dfrac 1 n \log \dfrac 1 {P[E_n]} \leq \inf_{\mathbb E_Q[X]>\gamma} D(Q\|P) \] To prove the upper bound, crucially note that for any event \(E\) with \(P_X[E]>0\), we have \(\log \frac 1 {P_X(E)} = D(P_{X|X\in E} \| P_X)\). Define \(\tilde P_{X^n} = P_{X^n|\bar X>\gamma}\), then \[ \log \dfrac 1 {P[E_n]} = D(\tilde P_{X^n} \| P_{X^n}) = \inf_{\mathbb E_{Q^n}[\bar X]>\gamma} D(Q_{X^n} \| P_{X^n}) \] The last infimum problem tensorizes by the tensorization property of products in the second argument and convexity in the first argument in the KL-divergence: \[ D(Q_{X^n} \| P_{X^n}) \geq \sum_{j=1}^n D(Q_{X_j} \| P) \geq nD(\bar Q \| P) \] This yields the desired inequality \[ \log \dfrac 1 {P[E_n]} = \inf_{\mathbb E_{Q^n}[\bar X]>\gamma} D(Q_{X^n} \| P_{X^n}) = n\inf_{\mathbb E_Q[X]\geq \gamma} D(Q\|P) \]

The previous part motivates the study of the problem \(\inf_{Q\in \mathcal E} D(Q\|P)\) where \(\mathcal E\subset \Omega\) is a convex subset of distributions.

Theorem 11.10 (orthogonality of information projection) Given a convex set \(\mathcal E\) of distributions. If there exists \(Q^*\in \mathcal E\) such that \(D(Q^* \| P) = \min_{Q\in \mathcal E} D(Q\|P)<\infty\), then \(\forall Q\in \mathcal E\), we obtain \[ D(Q\|P) \geq D(Q\|Q^*) + D(Q^* \| P) \]

Proof Consider the nontrivial case \(D(Q^* \| P) \leq D(Q\|P)<\infty\). For \(\lambda\in [0, 1]\), consider the convex combination \(Q^{(\lambda)} = \bar \lambda Q^* + \lambda Q\). By \(Q^*\) being the minimizer of \(D(Q\|P)\), we obtain \[ 0 \leq \dfrac d {d\lambda}\bigg|_{\lambda=0} D(Q^{(\lambda)} \| P) = D(Q\|P) - D(Q\|Q^*) - D(Q^* \| P) \]