10 Lossless compression
This section corresponds to Chapter 10 of the book. Important takeaways:
- Regimes for lossless data compression:
- Single-shot: encode a symbol, then decode a symbol. For single-shot encoding, one may consider grouping several symbols into a super-symbol; a special case of this is when the source is i.i.d.
- Code extension (definition 10.6): codewords are concatenated.
- Compression length of the optimal single-shot lossless is related to the entropy density (proposition 10.2) and is optimal in every sense by stochastic dominance (proposition 10.1).
- Kraft-McMillan (theorem 10.3): constructive bijection between inequality-satisfying code lengths and prefix-free codes. In particular, in terms of encoding length, the uniquely decodable codes do not offer any advantage over prefix codes.
- Theorem 10.4: bound on optimal prefix code.
- Main inequality: theorems
10.1 and 10.4.
\[
H(X) - \log_2[e(H(X)+1)] \leq \mathbb E[L_{\mathrm{single-shot}}^*(X)]
\leq H(X) \leq \mathbb E[L_{\mathrm{prefix}}^*(X)] \leq H(X)+1
\]
- In fixed-length almost-lossless compression (definition 10.7), a single emission of the source is compressed to \(\{0, 1\}^k\) instead of \(\{0, 1\}^*\).
- Shannon’s noiseless coding theorem (corollary 10.3): holds in the limit of asymptotically large copies of the i.i.d. source \(R^n\) v.s. the compression length \(nR\).
- Asymptotic equipartition property (proposition 10.3): for asymptotically large copies of i.i.d. source, the dominant occurences have equiprobability.
To motivate later studies, we note two limitations of Huffmann codes (optimal prefix-free variable-length code):
- It requires knowing the source distribution. This is
addressed by the study of universal compression.
- Lempel-Ziv is universal, has low complexity, and is provably optimal for all ergodic sources.
- Often \(H(S^n)\ll nH(S)\) (e.g. not all arrangements of letters are
words, and sentences are subject to semantic constraints).
Thus applying Huffmann to a block \((S_1, \cdots, S_n)\) is much
more rate-efficient. However, this construction is exponential
in \(n\).
- Arithmetic coding: complexity linear in the block-length while attaining \(H(S_1^n)\) length.
Variable-length source coding
Fixing a source domain \(\mathcal X\), we can define:
- A compressor \(f:\mathcal X\to \{0, 1\}^*\).
- A decompressor \(g:\{0, 1\}^*\to \mathcal X\).
The single-shot in the definition below refers to the fact that we are compressing and decompressing each symbol, instead of compressing many then decompressing them; this we do not need to impose any constraints on \(f\), such as prefix-freeness or unique-decodability.
Definition 10.1 (variable-length single-shot lossless compression) A pair \((f, g)\) is a variable-length single-shot lossless compressor if:
- \(f\) is of the type \(f:\mathcal X\to \{0, 1\}^*\). The image of \(f\) is a codebook, and an element \(f(x)\) in the image of \(f\) is a codeword.
- There exists a decompressor such that \(g\circ f = 1_{\mathcal X}\).
Two immediate results of this definition:
- Lossless compression is only possible for discrete \(X\).
- Without loss of generality, we can sort \(\mathcal X\) so that \(P(0)\geq P(1)\geq \cdots\).
We begin with several definitions.
Definition 10.2 (entropy density) Given a finite alphabet and fixing binary units, the entropy density function of a source \(X\) is \[ i_X(a) = \log_2 \dfrac 1 {P_X(a)}\implies H(X) = \mathbb E[i_X(x)] \]
Definition 10.3 (varentropy) The varentropy \(V(S)\) of a source \(S\) is \[ V(S) = \mathrm{Var}[i_S] \] Note that \(H(S)^2 + V(S)^2 = \mathbb E[i_S^2]\).
Definition 10.4 (stochastic dominance) Given two real-valued stochastic variables \(X, Y\), we say that \(X\preceq Y\), or \(Y\) stochastically dominates \(X\), if the CDF of \(X\) dominates that of \(Y\) everywhere; this formalizes the notion that \(X\) is concentrated on smaller values.
Definition 10.5 (optimal single-shot lossless compressor) For a down-sorted PMF \(P(X)\), the optimal single-shot lossless compressor assigns strings with increasing lengths to symbol \(j\in \mathcal X=\mathbb N\). In particular, \(l(f^*(j)) = \lfloor \log_2 j\rfloor\), where \(l\) is the string-length function. Let \(L(j) = \lfloor \log_2 j \rfloor\) denote the compression length of input \(j\).
An immediate corollary of the following result is that \(\mathbb E[(l\circ f^*)(X)] \leq \mathbb E[(l\circ f)(X)]\) for any \(X\).
Proposition 10.1 (optimality of optimal encoding) For any lossless single-shot compressor \(f\) and source distribution \(X\), \(l\circ f^*\preceq l\circ f\).
Proof
Let \(A_k = \{x:l(f(x)) \leq k\}\) denote the set of inputs which are encoded to length at most \(k\). Note that, since encoding is lossless and there are at most \(2^{k+1}-1\) binary strings of length \(k\), we have \[ |A_k| \leq \sum_{j=0}^k 2^j = 2^{k+1}-1 = |A_k^*| \] The stochastic dominance follows from the fact that \(f^*\) sorts the PMF in descending order: \[ \mathrm{Pr}[(l\circ f)(X)\leq k] = \sum_{x\in A_k} P_X(x) \leq \sum_{x\in A^*_k} P_X(x) = \mathrm{Pr}[(l\circ f^*)(X) \leq k] \]Proposition 10.2 (entropy density dominates encoding length) When \(X\) is sorted (as in the case of optimal compressor), \(L\preceq i_X\).
Proof
Note that \(1/x \leq P_X(x)\), then \(L(x)\leq i_X(x)\implies L\preceq i_X\) based on \[ L(x) = \lfloor \log_2 x\rfloor\leq -\log_2 \dfrac 1 x \leq -\log_2 P_X(x) = i_X(x) \]A coding theorem is one that relates an operational, intractable compression quantity (e.g. \(\mathbb E[L(X)]\)) to an information measure (e.g. \(H(X)\)).
Lemma 10.1 \(H(X|L=k)\leq k\). This follows from that \(X|L=k\) can take at most \(2^k\) values, the uniform distribution on which has entropy \(k\).
Lemma 10.2 For \(X\) taking values on \(\mathbb N=\{1, 2, \cdots\}\) and \(\mathbb E[X]<\infty\), we have \[ H(X) \leq h\left(\mathbb E[X]^{-1}\right)\mathbb E[X] \] The inequality is saturated by the geometric distribution.
Proof
Let \(p = 1 / \mathbb E[X]\) and note that \(\mathbb E[xp] = 1, \mathbb E[x\bar p] = \mathbb E[x(1-p)] = \mathbb E[x - 1]\), then \[\begin{align} H(X) - \mathbb E[X] h(p) &= \mathbb E\left[-\log p(x) - x[-p\log p - \bar p\log \bar p]\right] \\ &= \mathbb E\left[-\log p(x) + xp \log p + x\bar p\log \bar p\right] \\ &= \mathbb E\left[-\log p(x) + (x-1) \log \bar p + \log p\right] \\ -D(P_X\|\mathrm{Geom}_p) &= -\mathbb E\log \dfrac{p(x)}{p{\bar p}^{x-1}} \\ &= \mathbb E\left[-\log p(x) + (x-1) \log \bar p + \log p\right] \\ -D(P_X\|\mathrm{Geom}_p) &= H(X) - \mathbb E[X] h(p) \\ H(X) &= \mathbb E[X] h(p) + D(P_X\|\mathrm{Geom}_p)_{\geq 0} \end{align}\]Theorem 10.1 (average length of optimal encoding) \(H(X) - \log_2[e(H(X)+1)] \leq \mathbb E[L(X)] \leq H(X)\)
Proof
Assume \(X\) sorted w.l.o.g, in light of proposition 10.2, we have \(L\preceq i_X\implies \mathbb E[L] \leq H(X)\). On the other hand, first note that \(H(X|L=k)\leq k\) since \(X|L=k\) can take at most \(2^k\) values, the uniform distribution on which has entropy \(k\). Next apply lemma 10.2.
\[\begin{align} H(X) &= H(X, L) = H(X|L) + H(L) \\ &\leq \mathbb E[L] + (\mathbb E[L] + 1) h\left(\dfrac 1 {1+\mathbb E[L]}\right) \\ &= \mathbb E[L] + \log(1+\mathbb E[L]) + \mathbb E[L] + \log \left(1 + \dfrac 1 {\mathbb E[L]}\right) \\ &= \mathbb E[L] + \log_2(1+\mathbb E[L]) + \log_2 e \leq \mathbb E[L] + \log_2 e(1+H(X)) \end{align}\] In the last tep steps, we used \(x\log(1+1/x) \leq \log e\) and \(H(X)\leq \mathbb E[L]\).Theorem 10.2 (code length distribution of optimal compression) \(\forall \tau>0, k\geq 0\): \[ \mathrm{Pr}\left[i_X(X)\leq k \right] \leq \mathrm{Pr}[L(X) \leq k] \leq \mathrm{Pr}\left[i_X(X) \leq k + \tau\right] + 2^{1-\tau} \] In other words, \(-\log_2 P_X\) stochastically dominates \(L\) but only up to a constant factor.
Proof
- Lower bound (achievability, or “compression length is smaller than”): proposition 10.2.
- Upper bound (converse, or “compression length cannot be greater than”): \[\begin{align} \mathrm{Pr}[L\leq k] &= \mathrm{Pr}[L \leq k, i_X(X)\leq k+\tau] + \mathrm{Pr}[L\leq k, i_X(X)> k+\tau] \\ &= \mathrm{Pr}[i_X\leq k+\tau] + (2^{k+1}-1) \sup_{i_X(X)>k+\tau} P_X(x) \\ &= \mathrm{Pr}[i_X\leq k + \tau] + 2^{1-\tau} \end{align}\] To bound the first quantity, we used \(\mathrm{Pr}[A, B]\leq \mathrm{Pr}[B]\). For the second, there are at most \(2^{k+1}-1\) atoms of mass at most \(2^{-k-\tau}\).
We can consider the source as a random process \((S_1, S_2, \cdots)\), group the first \(n\) (blocklength) symbols into a supersymbol \(S^n\).
Corollary 10.1 (asymptotic optimal-coding properties) Let \((S_1, S_2, \cdots)\) be a random process (such that \(S^n\) is sorted) and \(U, V\) real-valued r.vs, then \[\begin{align} \dfrac 1 n i_{S^n(S^n)}\xrightarrow d U &\iff \dfrac 1 n L(S^n)\xrightarrow d U \\ \dfrac 1 {\sqrt n} \left[ i_{S^n}(S^n) - H(S^n) \right] \xrightarrow d V & \iff \dfrac 1 {\sqrt n} \left[ L(S^n) - H(S^n) \right] \xrightarrow d V \end{align}\]
Proof
To obtain the first result, apply 10.2 with \(k=un\) and \(\tau = \sqrt n\). For the second, apply with \(k=H(S^n) + \sqrt n u\) and \(\tau = n^{1/4}\). \[\begin{align} \mathrm{Pr}[i_X \leq H + \sqrt n u] &\leq \mathrm{Pr}[L \leq H + \sqrt n u] \leq \mathrm{Pr}[i_X \leq H + \sqrt n u + n^{1/4}] + 2^{1-n^{1/4}} \\ \mathrm{Pr}\left[ \dfrac{i_X - H} {\sqrt n} \leq u \right] &\leq \mathrm{Pr}\left[ \dfrac{L-H}{\sqrt n} \leq u \right] \leq \mathrm{Pr}\left[ \dfrac{i_X - H}{\sqrt n} \leq u + n^{-1/4} \right] + 2^{1-n^{1/4}} \end{align}\]Corollary 10.2 (i.i.d asymptotics) When \(S_j\) are i.i.d, we have
- \(L(S^n)/n \xrightarrow P H(S)\).
- If varentropy \(V(S)<\infty\), then \(V\) is Gaussian by CLT and \[ \dfrac 1 {\sqrt n V(S)} [L(S^n) - nH(S)] \xrightarrow d \mathcal N(0, 1) \] equivalently, in shorthand \(L(S^n) \sim nH(S) + \sqrt{nV(S)} \mathcal N(0, 1)\).
Uniquely decodable codes
We now focus on compressors whose output stream can be uniquely decoded. We consider a finite alphabet \(\mathcal A\) and let \(\mathcal A^*=\bigcup_{j=1}^\infty A^j\).
Definition 10.6 (code extension, unique decodability, prefix code) The symbol-by-symbol extension of \(f:\mathcal A\to \{0, 1\}^*\) to \(A^*\to \{0, 1\}^*\) is obtained by concatenating outputs. A compressor \(f\) is uniquely decodable if its extension is injective; it is a prefix (-free) code if no codeword is a prefix of each other.
Lossless codes \(\supsetneq\) uniquely decodable codes \(\supsetneq\) prefix codes. Consider
- \(f(a)=0, f(b)=1, f(c)=10\): lossless but not uniquely decodable.
- \(f(a)=0, f(b)=01, f(c)=011, f(d)=0111\): uniquely decodable but not prefix; decoder needs to look for the next \(0\) to know when token ends.
- \(f(a)=0, f(b)=10, f(c)=11\): prefix code.
Prefix codes are in bijective correspondence with binary trees.
Theorem 10.3 (Kraft-McMillan) Given a uniquely decodable \(f:\mathcal A\to \{0, 1\}^*\) and let \(L=l\circ f\) be the code length function. Then \(f\) satisfies the Kraft inequality \[ \sum_{a\in \mathcal A} 2^{-L(a)} \leq 1 \] With \(2\mapsto D\) for general size-\(D\) alphabets. Conversely, for any set of code length \(\{L(a):a\in \mathcal A\}\) satisfying the Kraft inequality, there exists an efficiently computable prefix code with length function \(L\).
Proof
Let \(f\) be a uniquely decodable code. Assuming \(\mathcal A\) finite and define \(L^*=\max_{a\in \mathcal A} L(a)\). Let \[ G(f, z) = \sum_{a\in \mathcal A} z^{L(a)} = \sum_{l=0}^{L^*} A(f, l) z^l \] here \(A(f, l)\) is the number of codewords of length \(l\) in \(f\). Consider the extension \(f^{k\geq 1}\) and note that \[ G(f^k, z) = \sum_{a^k \in \mathcal A^k} z^{L(a_1)+\cdots + L(a_k)} = G(f, z)^k = \sum_{l=0}^{kL} A(f^k, l)z^l \] The key step here is the combinatorial equation \(G(f^k, z) = G(f, z)^k\). Since \(f^k\) is lossless, we have \(A(f^k, \forall l)\leq 2^l\), then
\(G(f^k, 1/2) \leq kL\) for all \(k\), then \(G(f, 1/2) = G(f^k, 1/2)^{1/k} \leq (kL)^{1/k}\).
Take the limit \(k\to \infty (kL)^{1/k}=1\) to obtain \(G(f, k)\leq 1\). The countably infinite case concludes by the arbitrariness of a finite subset \(\mathcal A'\subset \mathcal A\). For the converse, without loss of generality relabel \(\mathcal A\) to \(\mathcal N\) and assume \(1\leq L(1)\leq L(2) \leq \cdots\). Given a set of lengths \(\{L(a\in \mathcal A)\}\) satisfying \(\sum_{a\in \mathcal A} 2^{-L(a)} \leq 1\), define for each \(j\) \[ a_j = \sum_{k=1}^{j-1} 2^{-L(k)} \leq 1, \quad a_1 = 0 \] Define \(f(j)\) as the first \(L(j)\) bits in the binary expansion of \(a_j\). In other words, to code \(i\):
Start with a \(l_i\) string of zeros.
Flip positions \(l_1, \cdots, l_{i-1}\) to \(1\); this is the codeword.
In particular, the Kraft inequality implies that the sorted prefix-code lengths can be at most \(1, 2, \cdots\). One can also thus formulate the prefix-code with optimal length as the following integer-programming problem: \[ L^*(X) = \min_{L:\mathcal A\to \mathbb N} \sum_{a\in \mathcal A} P_X(a) L(a) \text{ s.t. } \sum_{a\in \mathcal A}2^{-L(a)} \leq 1 \tag{10.1} \]
We now proceed to analyze the bound on uniquely decodable codes.
Theorem 10.4 (length of optimal prefix code) \(H(X) \leq L^*(X) \leq H(X)+1\) in units of bits.
Proof
Achievability is established by the Shannon code with \[ l_a = \lceil \log_2 \dfrac 1 {P_X(a)}\rceil \implies \sum_{a\in \mathcal A} 2^{-l_a} \leq \sum_{a\in \mathcal A} P_X(a) = 1 \] For the converse, for encoding \(f\) with lengths \(\{l_a\}\) satisfying the Kraft inequality, define a normalized probability measure \(Q_X(a) = 2^{-l_a} / Z\) with \(Z=\sum 2^{-l_a} \leq 1\). Then \[\begin{align} \mathbb E_P[l\circ f] - H(X) &= \sum P_X(a) l_a + \sum P_X(a) \log P_X(a) \\ &= \sum P_X(a) \log 2^{l_a} P_X(a) = \sum P_X(a) \log \dfrac{P_X(a)}{2^{-l_a}} \\ &= \sum P_X(a) \log \dfrac{P_X(a)}{Q_X(a)} - \sum P_X(a) \log Z \\ &= D(P\|Q) - (\log Z)_{\geq 0} \geq 0 \end{align}\]The optimal encoding is achieved by the Huffmann code: given PMF \(P_X(a\in \mathcal A)\): For each step, choose the two least probable symbols in the alphabet, merge into a super-symbol with combined probability, then update the tree representation with super-symbol at the node and merged symbols as leaves. Repeat for \(|\mathcal A| - 1\) steps.
Fixed-length source coding
To be precise, we consider fixed-length almost-lossless compression. A single emission of the source is compressed to \(\{0, 1\}^k\) instead of \(\{0, 1\}^*\).
Definition 10.7 (fixed-length almost-lossless compression) A compressor-decompressor pair \(f:\mathcal X\to \{0, 1\}^k, g:\{0, 1\}^k\to\mathcal X\cup \{e\}\) is a fixed-length almost-lossless \((k, \epsilon)\) source-code for \(X\) if \[ (g\circ f)(\forall x) \in \{x, e\} \text{ and } \mathrm{Pr}[g\circ f = e] \leq \epsilon \] Fixing \(X, k\), the fundamental limit of fixed-length compression is \[ \epsilon^*(X, k) = \inf\{\epsilon: \exists (k, \epsilon)\text{-code for } X\} \]
Recall the optimal single-shot (variable-length) lossless compressor length \(L\) in definition 10.5.
Theorem 10.5 (fundamental limit of fixed-length compression) Assuming pmf-sorted alphabet on \(\mathbb N\), then \[ \epsilon^*(X, k) = \mathrm{Pr}[L(X) \geq k] = \sum_{x\geq 2^k} P_X(x) \]
Proof
The compressor reserves \(1^k\) for the error message. To suppress the error rate, we choose to encode the top \(2^k-1\) symbols with the highest probabilities.Corollary 10.3 (Shannon's noiseless source coding theorem) Given i.i.d. discrete source \(S^n\), for any rate \(R>0\) and \(\gamma\in \mathbb R\) we have \[ \lim_{n\to \infty} \epsilon^*(S^n, nR) = \begin{cases} 0 & R > H(S) \\ 1 & R < H(S) \end{cases} \] If varentropy \(V(S) < \infty\) then for \(Q\) being one minus the CDF of \(\mathcal N(0, 1)\): \[ \lim_{n\to \infty} \epsilon^*(S^n, nH(S)+ \sqrt{nV(S)}\gamma) = Q(\gamma) \]
The second distribution claim follows from the asymptotic normality of the optimal encoding length (corollary 10.2). This implies that if we allow a non-vanishing error \(\epsilon\), then compression is possible down to \[ k = nH(S) + \sqrt{nV(S)} Q^{-1}(\epsilon) \]
Theorem 10.6 (finite block-length bounds) Placeholder.
Proposition 10.3 (asymptotic equipartition (AEP)) Given i.i.d. \(S^n\) and \(\delta>0\), define the entropy \(\delta\)-typical set \[ T_n^\delta = \left\{ s^n : \left| \dfrac 1 n \log \dfrac 1 {P_{S^n}(s^n)} - H(S) \right| \leq \delta \right\} \] Then \(\lim_{n\to \infty} \mathrm{Pr}[S^n\in T_n^\delta]\to 1\) and \(|T_n^\delta| \leq \exp[n(H(S)+\delta)]\).