10 Lossless compression

This section corresponds to Chapter 10 of the book. Important takeaways:

Regimes for lossless data compression:
- Single-shot: encode a symbol, then decode a symbol. For single-shot encoding, one may consider grouping several symbols into a super-symbol; a special case of this is when the source is i.i.d.
- Code extension (definition 10.6): codewords are concatenated.
Compression length of the optimal single-shot lossless is related to the entropy density (proposition 10.2) and is optimal in every sense by stochastic dominance (proposition 10.1).
- Theorem 10.1 for the the optimal average-length \(\mathbb E[L]\).
- The code length distribution is also related to that of entropy density (theorem 10.2), resulting in optimal-coding asymptotics for i.i.d. source grouped into a supersymbol (corollary 10.2).
Kraft-McMillan (theorem 10.3): constructive bijection between inequality-satisfying code lengths and prefix-free codes. In particular, in terms of encoding length, the uniquely decodable codes do not offer any advantage over prefix codes.
Theorem 10.4: bound on optimal prefix code.
Main inequality: theorems 10.1 and 10.4. \[ H(X) - \log_2[e(H(X)+1)] \leq \mathbb E[L_{\mathrm{single-shot}}^*(X)] \leq H(X) \leq \mathbb E[L_{\mathrm{prefix}}^*(X)] \leq H(X)+1 \]
In fixed-length almost-lossless compression (definition 10.7), a single emission of the source is compressed to \(\{0, 1\}^k\) instead of \(\{0, 1\}^*\).
Shannon’s noiseless coding theorem (corollary 10.3): holds in the limit of asymptotically large copies of the i.i.d. source \(R^n\) v.s. the compression length \(nR\).
Asymptotic equipartition property (proposition 10.3): for asymptotically large copies of i.i.d. source, the dominant occurences have equiprobability.

To motivate later studies, we note two limitations of Huffmann codes (optimal prefix-free variable-length code):

It requires knowing the source distribution. This is addressed by the study of universal compression.
- Lempel-Ziv is universal, has low complexity, and is provably optimal for all ergodic sources.
Often \(H(S^n)\ll nH(S)\) (e.g. not all arrangements of letters are words, and sentences are subject to semantic constraints). Thus applying Huffmann to a block \((S_1, \cdots, S_n)\) is much more rate-efficient. However, this construction is exponential in \(n\).
- Arithmetic coding: complexity linear in the block-length while attaining \(H(S_1^n)\) length.

Variable-length source coding

Fixing a source domain \(\mathcal X\), we can define:

A compressor \(f:\mathcal X\to \{0, 1\}^*\).
A decompressor \(g:\{0, 1\}^*\to \mathcal X\).

The single-shot in the definition below refers to the fact that we are compressing and decompressing each symbol, instead of compressing many then decompressing them; this we do not need to impose any constraints on \(f\), such as prefix-freeness or unique-decodability.

Definition 10.1 (variable-length single-shot lossless compression) A pair \((f, g)\) is a variable-length single-shot lossless compressor if:

\(f\) is of the type \(f:\mathcal X\to \{0, 1\}^*\). The image of \(f\) is a codebook, and an element \(f(x)\) in the image of \(f\) is a codeword.
There exists a decompressor such that \(g\circ f = 1_{\mathcal X}\).

Two immediate results of this definition:

Lossless compression is only possible for discrete \(X\).
Without loss of generality, we can sort \(\mathcal X\) so that \(P(0)\geq P(1)\geq \cdots\).

We begin with several definitions.

Definition 10.2 (entropy density) Given a finite alphabet and fixing binary units, the entropy density function of a source \(X\) is \[ i_X(a) = \log_2 \dfrac 1 {P_X(a)}\implies H(X) = \mathbb E[i_X(x)] \]

Definition 10.3 (varentropy) The varentropy \(V(S)\) of a source \(S\) is \[ V(S) = \mathrm{Var}[i_S] \] Note that \(H(S)^2 + V(S)^2 = \mathbb E[i_S^2]\).

Definition 10.4 (stochastic dominance) Given two real-valued stochastic variables \(X, Y\), we say that \(X\preceq Y\), or \(Y\) stochastically dominates \(X\), if the CDF of \(X\) dominates that of \(Y\) everywhere; this formalizes the notion that \(X\) is concentrated on smaller values.

Definition 10.5 (optimal single-shot lossless compressor) For a down-sorted PMF \(P(X)\), the optimal single-shot lossless compressor assigns strings with increasing lengths to symbol \(j\in \mathcal X=\mathbb N\). In particular, \(l(f^*(j)) = \lfloor \log_2 j\rfloor\), where \(l\) is the string-length function. Let \(L(j) = \lfloor \log_2 j \rfloor\) denote the compression length of input \(j\).

An immediate corollary of the following result is that \(\mathbb E[(l\circ f^*)(X)] \leq \mathbb E[(l\circ f)(X)]\) for any \(X\).

Proposition 10.1 (optimality of optimal encoding) For any lossless single-shot compressor \(f\) and source distribution \(X\), \(l\circ f^*\preceq l\circ f\).

Proof

Let \(A_k = \{x:l(f(x)) \leq k\}\) denote the set of inputs which are encoded to length at most \(k\). Note that, since encoding is lossless and there are at most \(2^{k+1}-1\) binary strings of length \(k\), we have \[ |A_k| \leq \sum_{j=0}^k 2^j = 2^{k+1}-1 = |A_k^*| \] The stochastic dominance follows from the fact that \(f^*\) sorts the PMF in descending order: \[ \mathrm{Pr}[(l\circ f)(X)\leq k] = \sum_{x\in A_k} P_X(x) \leq \sum_{x\in A^*_k} P_X(x) = \mathrm{Pr}[(l\circ f^*)(X) \leq k] \]

Proposition 10.2 (entropy density dominates encoding length) When \(X\) is sorted (as in the case of optimal compressor), \(L\preceq i_X\).

Proof

Note that \(1/x \leq P_X(x)\), then \(L(x)\leq i_X(x)\implies L\preceq i_X\) based on \[ L(x) = \lfloor \log_2 x\rfloor\leq -\log_2 \dfrac 1 x \leq -\log_2 P_X(x) = i_X(x) \]

A coding theorem is one that relates an operational, intractable compression quantity (e.g. \(\mathbb E[L(X)]\)) to an information measure (e.g. \(H(X)\)).

Lemma 10.1 \(H(X|L=k)\leq k\). This follows from that \(X|L=k\) can take at most \(2^k\) values, the uniform distribution on which has entropy \(k\).

Lemma 10.2 For \(X\) taking values on \(\mathbb N=\{1, 2, \cdots\}\) and \(\mathbb E[X]<\infty\), we have \[ H(X) \leq h\left(\mathbb E[X]^{-1}\right)\mathbb E[X] \] The inequality is saturated by the geometric distribution.

Proof

Let \(p = 1 / \mathbb E[X]\) and note that \(\mathbb E[xp] = 1, \mathbb E[x\bar p] = \mathbb E[x(1-p)] = \mathbb E[x - 1]\), then \[\begin{align} H(X) - \mathbb E[X] h(p) &= \mathbb E\left[-\log p(x) - x[-p\log p - \bar p\log \bar p]\right] \\ &= \mathbb E\left[-\log p(x) + xp \log p + x\bar p\log \bar p\right] \\ &= \mathbb E\left[-\log p(x) + (x-1) \log \bar p + \log p\right] \\ -D(P_X\|\mathrm{Geom}_p) &= -\mathbb E\log \dfrac{p(x)}{p{\bar p}^{x-1}} \\ &= \mathbb E\left[-\log p(x) + (x-1) \log \bar p + \log p\right] \\ -D(P_X\|\mathrm{Geom}_p) &= H(X) - \mathbb E[X] h(p) \\ H(X) &= \mathbb E[X] h(p) + D(P_X\|\mathrm{Geom}_p)_{\geq 0} \end{align}\]

Theorem 10.1 (average length of optimal encoding) \(H(X) - \log_2[e(H(X)+1)] \leq \mathbb E[L(X)] \leq H(X)\)

Proof

Assume \(X\) sorted w.l.o.g, in light of proposition 10.2, we have \(L\preceq i_X\implies \mathbb E[L] \leq H(X)\). On the other hand, first note that \(H(X|L=k)\leq k\) since \(X|L=k\) can take at most \(2^k\) values, the uniform distribution on which has entropy \(k\). Next apply lemma 10.2.

\[\begin{align} H(X) &= H(X, L) = H(X|L) + H(L) \\ &\leq \mathbb E[L] + (\mathbb E[L] + 1) h\left(\dfrac 1 {1+\mathbb E[L]}\right) \\ &= \mathbb E[L] + \log(1+\mathbb E[L]) + \mathbb E[L] + \log \left(1 + \dfrac 1 {\mathbb E[L]}\right) \\ &= \mathbb E[L] + \log_2(1+\mathbb E[L]) + \log_2 e \leq \mathbb E[L] + \log_2 e(1+H(X)) \end{align}\] In the last tep steps, we used \(x\log(1+1/x) \leq \log e\) and \(H(X)\leq \mathbb E[L]\).

Theorem 10.2 (code length distribution of optimal compression) \(\forall \tau>0, k\geq 0\): \[ \mathrm{Pr}\left[i_X(X)\leq k \right] \leq \mathrm{Pr}[L(X) \leq k] \leq \mathrm{Pr}\left[i_X(X) \leq k + \tau\right] + 2^{1-\tau} \] In other words, \(-\log_2 P_X\) stochastically dominates \(L\) but only up to a constant factor.

Proof idea: \(\mathrm{Pr}[A]= \mathrm{Pr}[A, B] + \mathrm{Pr}[A, B^c] \leq \mathrm{Pr}[B] + \mathrm{Pr}[A, B^c]\)

Proof

Lower bound (achievability, or “compression length is smaller than”): proposition 10.2.
Upper bound (converse, or “compression length cannot be greater than”): \[\begin{align} \mathrm{Pr}[L\leq k] &= \mathrm{Pr}[L \leq k, i_X(X)\leq k+\tau] + \mathrm{Pr}[L\leq k, i_X(X)> k+\tau] \\ &= \mathrm{Pr}[i_X\leq k+\tau] + (2^{k+1}-1) \sup_{i_X(X)>k+\tau} P_X(x) \\ &= \mathrm{Pr}[i_X\leq k + \tau] + 2^{1-\tau} \end{align}\] To bound the first quantity, we used \(\mathrm{Pr}[A, B]\leq \mathrm{Pr}[B]\). For the second, there are at most \(2^{k+1}-1\) atoms of mass at most \(2^{-k-\tau}\).

We can consider the source as a random process \((S_1, S_2, \cdots)\), group the first \(n\) (blocklength) symbols into a supersymbol \(S^n\).

Corollary 10.1 (asymptotic optimal-coding properties) Let \((S_1, S_2, \cdots)\) be a random process (such that \(S^n\) is sorted) and \(U, V\) real-valued r.vs, then \[\begin{align} \dfrac 1 n i_{S^n(S^n)}\xrightarrow d U &\iff \dfrac 1 n L(S^n)\xrightarrow d U \\ \dfrac 1 {\sqrt n} \left[ i_{S^n}(S^n) - H(S^n) \right] \xrightarrow d V & \iff \dfrac 1 {\sqrt n} \left[ L(S^n) - H(S^n) \right] \xrightarrow d V \end{align}\]

Proof

To obtain the first result, apply 10.2 with \(k=un\) and \(\tau = \sqrt n\). For the second, apply with \(k=H(S^n) + \sqrt n u\) and \(\tau = n^{1/4}\). \[\begin{align} \mathrm{Pr}[i_X \leq H + \sqrt n u] &\leq \mathrm{Pr}[L \leq H + \sqrt n u] \leq \mathrm{Pr}[i_X \leq H + \sqrt n u + n^{1/4}] + 2^{1-n^{1/4}} \\ \mathrm{Pr}\left[ \dfrac{i_X - H} {\sqrt n} \leq u \right] &\leq \mathrm{Pr}\left[ \dfrac{L-H}{\sqrt n} \leq u \right] \leq \mathrm{Pr}\left[ \dfrac{i_X - H}{\sqrt n} \leq u + n^{-1/4} \right] + 2^{1-n^{1/4}} \end{align}\]

Corollary 10.2 (i.i.d asymptotics) When \(S_j\) are i.i.d, we have

\(L(S^n)/n \xrightarrow P H(S)\).
If varentropy \(V(S)<\infty\), then \(V\) is Gaussian by CLT and \[ \dfrac 1 {\sqrt n V(S)} [L(S^n) - nH(S)] \xrightarrow d \mathcal N(0, 1) \] equivalently, in shorthand \(L(S^n) \sim nH(S) + \sqrt{nV(S)} \mathcal N(0, 1)\).

Uniquely decodable codes

We now focus on compressors whose output stream can be uniquely decoded. We consider a finite alphabet \(\mathcal A\) and let \(\mathcal A^*=\bigcup_{j=1}^\infty A^j\).

Definition 10.6 (code extension, unique decodability, prefix code) The symbol-by-symbol extension of \(f:\mathcal A\to \{0, 1\}^*\) to \(A^*\to \{0, 1\}^*\) is obtained by concatenating outputs. A compressor \(f\) is uniquely decodable if its extension is injective; it is a prefix (-free) code if no codeword is a prefix of each other.

Lossless codes \(\supsetneq\) uniquely decodable codes \(\supsetneq\) prefix codes. Consider

\(f(a)=0, f(b)=1, f(c)=10\): lossless but not uniquely decodable.
\(f(a)=0, f(b)=01, f(c)=011, f(d)=0111\): uniquely decodable but not prefix; decoder needs to look for the next \(0\) to know when token ends.
\(f(a)=0, f(b)=10, f(c)=11\): prefix code.

Prefix codes are in bijective correspondence with binary trees.

Theorem 10.3 (Kraft-McMillan) Given a uniquely decodable \(f:\mathcal A\to \{0, 1\}^*\) and let \(L=l\circ f\) be the code length function. Then \(f\) satisfies the Kraft inequality \[ \sum_{a\in \mathcal A} 2^{-L(a)} \leq 1 \] With \(2\mapsto D\) for general size-\(D\) alphabets. Conversely, for any set of code length \(\{L(a):a\in \mathcal A\}\) satisfying the Kraft inequality, there exists an efficiently computable prefix code with length function \(L\).

Proof

Let \(f\) be a uniquely decodable code. Assuming \(\mathcal A\) finite and define \(L^*=\max_{a\in \mathcal A} L(a)\). Let \[ G(f, z) = \sum_{a\in \mathcal A} z^{L(a)} = \sum_{l=0}^{L^*} A(f, l) z^l \] here \(A(f, l)\) is the number of codewords of length \(l\) in \(f\). Consider the extension \(f^{k\geq 1}\) and note that \[ G(f^k, z) = \sum_{a^k \in \mathcal A^k} z^{L(a_1)+\cdots + L(a_k)} = G(f, z)^k = \sum_{l=0}^{kL} A(f^k, l)z^l \] The key step here is the combinatorial equation \(G(f^k, z) = G(f, z)^k\). Since \(f^k\) is lossless, we have \(A(f^k, \forall l)\leq 2^l\), then

\(G(f^k, 1/2) \leq kL\) for all \(k\), then \(G(f, 1/2) = G(f^k, 1/2)^{1/k} \leq (kL)^{1/k}\).
Take the limit \(k\to \infty (kL)^{1/k}=1\) to obtain \(G(f, k)\leq 1\). The countably infinite case concludes by the arbitrariness of a finite subset \(\mathcal A'\subset \mathcal A\). For the converse, without loss of generality relabel \(\mathcal A\) to \(\mathcal N\) and assume \(1\leq L(1)\leq L(2) \leq \cdots\). Given a set of lengths \(\{L(a\in \mathcal A)\}\) satisfying \(\sum_{a\in \mathcal A} 2^{-L(a)} \leq 1\), define for each \(j\) \[ a_j = \sum_{k=1}^{j-1} 2^{-L(k)} \leq 1, \quad a_1 = 0 \] Define \(f(j)\) as the first \(L(j)\) bits in the binary expansion of \(a_j\). In other words, to code \(i\):
Start with a \(l_i\) string of zeros.
Flip positions \(l_1, \cdots, l_{i-1}\) to \(1\); this is the codeword.

To show that this is prefix-free, note that the codeword has the interpretation of the binary expansion in \([0, 1]\). Then \(f(i)\) being a prefix of \(f(j\neq i)\) implies that the codings agree on the most significant \(L(i)\) bits so \(a_j - a_i \leq 2^{-L(i)}\). However, by definition of the encoding \(a_j - a_i = 2^{-L(i)} + 2^{-L(i+1)} + \cdots > 2^{-L(i)}\).

In particular, the Kraft inequality implies that the sorted prefix-code lengths can be at most \(1, 2, \cdots\). One can also thus formulate the prefix-code with optimal length as the following integer-programming problem: \[ L^*(X) = \min_{L:\mathcal A\to \mathbb N} \sum_{a\in \mathcal A} P_X(a) L(a) \text{ s.t. } \sum_{a\in \mathcal A}2^{-L(a)} \leq 1 \tag{10.1} \]

We now proceed to analyze the bound on uniquely decodable codes.

Theorem 10.4 (length of optimal prefix code) \(H(X) \leq L^*(X) \leq H(X)+1\) in units of bits.

Proof

Achievability is established by the Shannon code with \[ l_a = \lceil \log_2 \dfrac 1 {P_X(a)}\rceil \implies \sum_{a\in \mathcal A} 2^{-l_a} \leq \sum_{a\in \mathcal A} P_X(a) = 1 \] For the converse, for encoding \(f\) with lengths \(\{l_a\}\) satisfying the Kraft inequality, define a normalized probability measure \(Q_X(a) = 2^{-l_a} / Z\) with \(Z=\sum 2^{-l_a} \leq 1\). Then \[\begin{align} \mathbb E_P[l\circ f] - H(X) &= \sum P_X(a) l_a + \sum P_X(a) \log P_X(a) \\ &= \sum P_X(a) \log 2^{l_a} P_X(a) = \sum P_X(a) \log \dfrac{P_X(a)}{2^{-l_a}} \\ &= \sum P_X(a) \log \dfrac{P_X(a)}{Q_X(a)} - \sum P_X(a) \log Z \\ &= D(P\|Q) - (\log Z)_{\geq 0} \geq 0 \end{align}\]

The optimal encoding is achieved by the Huffmann code: given PMF \(P_X(a\in \mathcal A)\): For each step, choose the two least probable symbols in the alphabet, merge into a super-symbol with combined probability, then update the tree representation with super-symbol at the node and merged symbols as leaves. Repeat for \(|\mathcal A| - 1\) steps.

Fixed-length source coding

To be precise, we consider fixed-length almost-lossless compression. A single emission of the source is compressed to \(\{0, 1\}^k\) instead of \(\{0, 1\}^*\).

Definition 10.7 (fixed-length almost-lossless compression) A compressor-decompressor pair \(f:\mathcal X\to \{0, 1\}^k, g:\{0, 1\}^k\to\mathcal X\cup \{e\}\) is a fixed-length almost-lossless \((k, \epsilon)\) source-code for \(X\) if \[ (g\circ f)(\forall x) \in \{x, e\} \text{ and } \mathrm{Pr}[g\circ f = e] \leq \epsilon \] Fixing \(X, k\), the fundamental limit of fixed-length compression is \[ \epsilon^*(X, k) = \inf\{\epsilon: \exists (k, \epsilon)\text{-code for } X\} \]

Recall the optimal single-shot (variable-length) lossless compressor length \(L\) in definition 10.5.

Theorem 10.5 (fundamental limit of fixed-length compression) Assuming pmf-sorted alphabet on \(\mathbb N\), then \[ \epsilon^*(X, k) = \mathrm{Pr}[L(X) \geq k] = \sum_{x\geq 2^k} P_X(x) \]

Proof

The compressor reserves \(1^k\) for the error message. To suppress the error rate, we choose to encode the top \(2^k-1\) symbols with the highest probabilities.

Corollary 10.3 (Shannon's noiseless source coding theorem) Given i.i.d. discrete source \(S^n\), for any rate \(R>0\) and \(\gamma\in \mathbb R\) we have \[ \lim_{n\to \infty} \epsilon^*(S^n, nR) = \begin{cases} 0 & R > H(S) \\ 1 & R < H(S) \end{cases} \] If varentropy \(V(S) < \infty\) then for \(Q\) being one minus the CDF of \(\mathcal N(0, 1)\): \[ \lim_{n\to \infty} \epsilon^*(S^n, nH(S)+ \sqrt{nV(S)}\gamma) = Q(\gamma) \]

The second distribution claim follows from the asymptotic normality of the optimal encoding length (corollary 10.2). This implies that if we allow a non-vanishing error \(\epsilon\), then compression is possible down to \[ k = nH(S) + \sqrt{nV(S)} Q^{-1}(\epsilon) \]

Theorem 10.6 (finite block-length bounds) Placeholder.

Proposition 10.3 (asymptotic equipartition (AEP)) Given i.i.d. \(S^n\) and \(\delta>0\), define the entropy \(\delta\)-typical set \[ T_n^\delta = \left\{ s^n : \left| \dfrac 1 n \log \dfrac 1 {P_{S^n}(s^n)} - H(S) \right| \leq \delta \right\} \] Then \(\lim_{n\to \infty} \mathrm{Pr}[S^n\in T_n^\delta]\to 1\) and \(|T_n^\delta| \leq \exp[n(H(S)+\delta)]\).