Information theory

Going through lecture series: Information Theory, Pattern Recognition, and Neural Networks by David MacKay. The textbook is available for free online.

Lecture 1: Introduction to Information Theory

Sending information over noisy channels (formulation)

We’re using systems to send information over imperfect channels:
- Source message \(s\).
- Encoded into a transmission message \(t\).
- Sent across an imperfect channel which adds noise \(n\).
- Received on the other side as message \(r\).
- Decoded into a guess of \(s\), \(\hat{s}\).
The encoder adds redundancy.
The decoder infers \(n\) and \(s\).

Binary symmetric channel (BSC)

\(s = t \in \{0, 1\}^+\)
Noise is added by flipping each token in \(s\) with probability \(f\).

Lectures 2-5: Entropy and Data Compression

Ensembles

An “Ensemble” \(X\) is \((x, A_x, P_x)\)
- \(x\) is a random variable.
- \(A_x\) is an alphabet \(\{a_1, a_2, ...\}\).
- \(P_x\) is a probability distribution over \(A_x\) \(\{p_1, p_2, ...\}\).
  - \(\sum{P_x} = 1\)

Shannon information content

The Shannon information content of an outcome \(x = a_i\) is:
- \(h(x = a_i) = log_2(\frac{1}{P(x=a_i)})\)
- Measured in bits.

Likely vs. unlikely events

Unlikely events have a lot of information content, likely events have little.

Relationship to compressed file length

\(h(x=a_i)\) is the best possible compressed file length.

Additive

\(h(x=a_i)\) is additive for independent random variables.
\(h(x,y) = log_2(\frac{1}{P(x,y)})\)
\(h(x,y) = log_2(\frac{1}{P(x)}) + log_2(\frac{1}{P(y)})\)
\(h(x,y) = h(x) + h(y)\)

Entropy

The entropy of an ensemble is the average Shannon information content.
- \(H(X) = \sum_a{P(x=a) h(x=a)}\)
- \(H(X) = \sum_a{P(x=a) \log_2(\frac{1}{P(x=a)})}\)
- Measured in bits.

Binary entropy

\[ H_2(x) = x log \frac{1}{x} \]

Source coding theorem

\(N\) outcomes from a source \(X\) can be compressed into \(~NH(X)\) bits.

Typical outcomes

Source \(X\) produces \(N\) independent outcomes \(\textbf{x}=x_1,x_2,...\).
Assume \(P_x\) has a long tail, i.e. some outcomes are much more likely than others.
\(x_i\) is likely to be one of \(~2^{NH(X)}\) typical outcomes.

Symbol codes

Map from symbol \(x\) to codes \(c(x)\).
Concatenate codes without punctuation.
- \([x_1,x_2,x_3] \to c(x_1)c(x_2)c(x_3)\)

Expected length

\(L(C, X) = \sum_x{P(x)len(C(x))}\)

Kraft inequality

If a code is uniquely decodable…
\(\sum_j{2^{-l_i}} \leq 1\)
When \(\sum_j{2^{-l_i}} = 1\), the code is complete.

Prove the length a message can’t be greater than its entropy

Proving \(L(C, X) \geq H(X)\).

Alphabet \(x_i\), codes \(c_i\), code lengths \(l_i\), probabilities \(p_i\).
Say that the ideal code lengths are \(l_i^{*} = h(x_i) = log_2(\frac{1}{p_i})\)
Given we already have \(l_i\), we can say what the “implicit” probability \(q_i\) of each \(x_i\) is:
- \(q_i = 2^{-l_i}\)
- But we need to make sure \(\sum{q_i} = 1\), so we normalize by \(z\).
  - \(z = \sum{2^{-l_i}}\)
  - \(z \leq 1\) for any uniquely decodable code (see Kraft inequality).
  - \(z = 1\) for any complete code.
- \(q_i = \frac{2^{-l_i}}{z} = \frac{2^{-l_i}}{\sum{2^{-l_i}}}\)
- So, \(l_i = log(\frac{1}{q_i}) - log(z)\)
\(L(C, X) = \sum_i{p_i l_i}\)
\(L(C, X) = \sum_i{p_i (log(\frac{1}{q_i}) - log(z))}\)
\(L(C, X) = \sum_i{p_i log(\frac{1}{q_i})} - log(z)\)
\(L(C, X) = \sum_i{p_i log(\frac{1}{p_i})} + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)\)
\(L(C, X) = H(X) + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)\)
\(L(C, X) = H(X) + D_{KL}(p || q) - log(z)\)
- \(D_{KL}(p || q) = \sum_i{p_i log(\frac{p_i}{q_i})}\)
- \(D_{KL}(p || q) \geq 0\)
\(-log(z) \geq 0\)
So, \(L(C, X) \geq H(X)\)
And \(L(C, X) = H(X)\) if:
- There’s no difference between \(p\) and \(q\): \(D_{KL}(p || q) = 0\).
- You have a complete code: \(z = 1\).
If ideal lengths \(l_i = log(\frac{1}{p_i})\) are not integers:
- \(H \leq L < H + 1\)

Huffman algorithm

Making a binary tree.
Given unconnected leaf nodes with values \(p_1, p_2, ...\).
Take two smallest leaf nodes, and make a split node.
Add the leaf node to the list of nodes.
Repeat until there’s only one node in the list, the root node.

Main issue

Symbol codes require \(\geq 1\) bit per character, even if \(H \ll 1\).
For example:
- \(p = [0.01, 0.99]\)
- \(c = [0, 1]\)
- \(L = 1\)
- \(H = 0.01 log(1 / 0.01) + 0.99 log(1 / 0.99) = 0.08\)
- So, \(H \leq L < H + 1 \to 0.08 \leq 1 < 1.08\) holds, but this isn’t efficient!

Arithmetic coding

Given a string in the alphabet \(\textbf{x} = x_1, x_2, ...\)
Given an oracle \(P(x_t | x_1, x_2, ..., x_{t-1})\).
Picture \(\textbf{x}\) dividing the interval \([0, 1]\) repeatedly. For example:
- Alphabet \({a, b, c}\).
- \(P(x_1) = {0.5, 0.25, 0.25}\)
- Interval of \(x_1 = a\) is \([0, 0.5]\).
- Interval of \(x_1 = b\) is \([0.5, 0.75]\).
- Interval of \(x_1 = c\) is \([0.75, 1]\).
- Further elements \(x_2, x_3, ...\) further subdivide the space.
Take the interval \([i_{lower}, i_{higher}]\) described by \(\textbf{x}\).
Find a binary string \(\textbf{y}\) with \(P(y_n=1) = 0.5\) that subdivides to the same interval.
\(\textbf{y}\) is now the encoded version of \(\textbf{x}\)!

Length of encoding

\(L(\textbf{x}) \leq log \frac{1}{P(\textbf{x})} + 2\)
\(L(\textbf{x}) \leq H(\textbf{x}) + 2\)

Computation

You only need to compute \(NI\) conditional probabilities, where \(N\) is the length of the string and \(I\) is the size of the alphabet.

Main advantage over Huffman algorithm

Huffman codes use at least 1 bit per token.
- \(L(x) \leq H(x) < L(x) + 1\) where \(x\) is a single token.
Arithmetic coding can use less than 1 bit per token.
- \(L(X) \leq H(X) + 2\) where \(X\) is the whole string.

Building simple predictor

Take \([0, 6]\)-gram statistics.
Use these to predict the next token.

Lectures 6-8: Noisy channel coding

Entropy of joint distributions

\(H(X, Y) = \sum_{x \in X} \sum_{y \in Y} P(x, y) h(x, y)\)
- where \(h(x, y) = log \frac{1}{P(x, y)}\)
\(H(X | Y=y) = \sum_{x \in X} P(x | Y=y) h(x | y)\)
- where \(h(x | Y=y) = log \frac{1}{P(x | Y=y)}\)
\(H(X | Y) = \sum_{y \in Y} P(Y=y) H(X | Y=y)\)
\(H(X) + H(Y) \geq H(X, Y)\)
\(H(X) + H(Y) = H(X, Y)\) if \(X\) and \(Y\) are independent.
\(H(X | Y) \leq H(X)\)
\(H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y)\)

Mutual information

\(I(X ; Y) = H(X) - H(X | Y)\)
\(I(X ; Y) = H(Y) - H(Y | X)\)
\(I(X ; Y) = 1 - H(X | Y) - H(Y | X)\)

Formalizing channels as matrices

We formalize channels as a matrix \(Q\), where:
- \(Q_{i, j} = P(Y = b_j | X = a_i)\)
- \(a\) is the input alphabet of length \(i\).
- \(b\) is the output alphabet of length \(j\).
We can find \(P(X | Y)\) using Bayes.

Capacity of a channel

\(C(Q) = \max_{P_x} I(X ; Y)\)
- i.e. the maximum mutual information given we can choose the input probability distribution.

Optimal input distribution

\(P_x\) chosen in \(C(Q) = \max_{P_x} I(X ; Y)\)

For BSC

Optimal input distribution is \(P(x_i = 1) = 0.5\)
Proof:
- \(p_1 = P(x_i = 1)\)
- \(I(X ; Y) = H(Y) - H(Y | X)\)
- \(I(X ; Y) = H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)\)
- \(max_{p_1} I(X ; Y) =\)
  - \(max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)\)
  - \(max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f)\)
  - Maximum value \(H_2\) can have is \(H_2(0.5) = 1\)
  - \(0.5 = p_1 (1 - f) + (1 - p_1) f\)
  - \(0.5 = p_1 - p_1 f + f - p_1 f\)
  - \(0.5 = p_1 - f - 2 p_1 f\)
  - \(0.5 = p_1 (1 - 2 f) - f\)
  - \(p_1 = \frac{0.5 + f}{1 - 2 f} = 0.5\)

Shannon’s noisy-channel coding theorem

For any \(\epsilon > 0\), \(R < C\), and large enough \(N\)…
There exists a code with length \(N\) and rate \(R\)…
Such that the probability of a block error is \(< \epsilon\).

Parity check matrix

Matrix definition

We transfer 4 bits, \(t_1, t_2, t_3, t_4\).
We append 3 parity bits:
- \(t_5 = t_1 + t_2 + t_3\)
- \(t_6 = t_2 + t_3 + t_4\)
- \(t_7 = t_1 + t_3 + t_4\)
We encode the relationships within \(\textbf{t}\) as a matrix \(H\), where \(H_{i, j}\) represents whether bit \(i\) was included in \(t_{3+j}\)’s parity calculation.

\[ H = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 1 & 1 & 0 & 1 & 0\\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix} \]

Note that the last three columns make up an identity matrix.

Encoding & decoding

\(t\) is encoded such that \(Ht = [0, 0, 0]^T = \textbf{0}\)
Noise is added to the received signal \(r = t + n\)
Syndrome \(z = Hr = H(t + n) = Hr + Hn = Hn\)
The column in \(H\) that is equal to \(z\) identifies which bit was flipped.

Generalizing >4

Above, \(H\) is a \((7 \times 3)\) matrix where \(4\) source bits are transmitted for every \(7\) transmission bits.
Generally, \(H\) is a \(N \times M\) matrix where \(T = N - M\) bits are transmitted for every \(N\) bits.
Thus, the rate is \(\frac{N - M}{N}\) or \(1 - \frac{M}{N}\).

Proving the theorem for BSC

Say we have:
- A BSC with probability \(f\) of flipping, with a capacity of \(1 - H_2(f)\)
- A parity check matrix \(H \in \mathbb{R}^{N \times M}\) where \(N > M\).
- A source and noise that have \(N - M\) bits.
- A transmission that has \(N\) bits.
- A syndrome that has \(M\) bits.
- A “typical set syndrome decoder” \(M \in \mathbb{R}^M \times \mathbb{R}^{N}\) that recovers the transmission from the syndrome.
  - This has \(2^{NH_2(f) + ...}\) entries.
Transmissions have a rate \(R = 1 - \frac{M}{N}\).
We want to prove \(R < C\) i.e. \(1 - \frac{M}{N} < 1 - H_2(f)\).
\(P(error) = P_1 + P_2\) where:
- \(P_1\) is the probability of \(n\) not being in \(M\).
  - This is arbitrarily small as \(M\) contains the typical set.
- \(P_2\) is the probability of there being a \(\hat{n} \neq n\) that has the same syndrome \(z\).
  - \(Hn = H\hat{n}\)
  - \(H(n - \hat{n}) = 0\)
  - We’ll focus on the constraints for \(P_2\).
- \(P_2 = \sum_n P(n) 1(\exists \hat{n}. \hat{n} \neq n, H(n - \hat{n}) = 0)\)
  - (there exists at least one is less than the number of existences)
- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} 1(H(n - \hat{n}) = 0)\)
  - (we take the average \(H\) as it is simpler to not assume a specific \(H\))
- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} \sum_{H} P(H) 1(H(n - \hat{n}) = 0)\)
- Aside: \(P(H(n - \hat{n}) = 0 | n - \hat{n} \neq 0) = 2^{-M}\)
- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} 2^{-M}\)
- \(P_2 \leq \sum_{\hat{n} \neq n} 2^{-M}\)
- \(P_2 \leq 2^{NH_2(f)} \cdot 2^{-M}\)
- \(P_2 \leq 2^{NH_2(f)- M}\)
- So, \(P_2\) vanishes as:
  - \(M \gg NH_2(f)\)
  - \(\frac{M}{N} > H_2(f)\)
  - \(1 - \frac{M}{N} < 1 - H_2(f)\)
  - \(R < C\)!!!

Entropy when probabilities are certain

\[ H_2(0) = P(0) \log(\frac{1}{0}) = 0 \]

\[ H_2(1) = P(1) \log(\frac{1}{1}) = 0 \]

Feedback gem

The encoder knowing what noise is added doesn’t improve the rate.

Lectures 9-10: Introduction to Bayesian inference

No notes, covers the basics.

Lectures 11-14: Approximating Probability Distributions

K-means

Clustering algorithm. Given \(\{x_n\}\) data points to put into \(K\) clusters:

Assign cluster means \(\{\mu_k\}\) randomly.
Assign each data point to the closest cluster: \(r^n_k = \arg\min_{k} (x_n - \mu_n)^2 = k\)
Update cluster means to the mean of its data points: \(\mu_k = \sum_n{r^n_k x_n} / \sum{r^n_k}\)
Go to 2.

Bayesian interpretation

K-means is a Maximum A Posteriori (MAP) algorithm which assumes:

There are \(K\) clusters.
The probability of being in each cluster is equivalent.
Each cluster has the same variance.

These assumptions might not be necessary. Additionally, MAP can result in incorrect cluster centers due to a hard boundary pushing clusters apart.

Soft K-means

Cluster responsibilities are no longer binary. Introduces an additional \(\beta\) hyperparameter. Update rule is the same.

\[ r^n_k = \frac{\exp(-\beta d(x_n, \mu_k))}{\sum_{k'} \exp(-\beta d(x_n, \mu_{k'}))} \]

Monte Carlo methods

The problem

We’re given:

\(P(x) = \frac{P^*(x)}{Z}\), some probability distribution that we can’t sample from.
\(Q(x)\), some simple probability distribution (e.g. uniform, gaussian) that we can sample from.
\(\phi(x)\), some function of interest that we want to run over \(P\).

We want to either be able to draw samples from \(P\), or to calculate the value of \(\phi\).

MC importance sampling

\[ \Phi = \frac{\sum_n \frac{P^*(x_n)}{Q(x_n)} \phi(x_n)}{\sum_n \frac{P^*(x_n)}{Q(x_n)}} \]

where \(\{x_n\}_N\) are \(N\) samples drawn from \(Q\).

Choice of Q

If \(Q\) has little overlap with \(P\), we’ll be sampling from the wrong part of the distribution. Although we’ll still eventually converge, this can take a while and its impossible to tell that we’re converging.

Rejection sampling

Given a \(Q\) and constant \(c\) such that:

\[ \forall x. cQ(x) \ge P^*(x) \]

Sample \(x \sim Q\).
Sample \(u \sim \text{Uniform}(0, cQ(x))\).
Store \(x\) if \(u < P^*(x)\), reject otherwise.

This provides \(\{x\}\) that are sampled from \(P\).

Markov Chain Monte Carlo (MCMC) methods

Different to Monte Carlo methods our samples are no longer independent. We draw from \(Q(x; x')\) rather than \(Q(x)\).

This works well when \(Q\) and \(P\) are very different.

Metropolis sampling

Sample \(x \sim Q(x; x_t)\).
Calculate \(a = \frac{P^*(x) Q(x_t; x)}{P^*(x_t) Q(x; x_t)}\).
If \(a>1\), accept, otherwise accept with probability \(a\).
- Accept means \(x_{t+1} = x\).
- Reject means \(x_{t+1} = x_t\).

Asymptotically, \(\{x\}\) is sampled from \(P\), but they won’t be independent.

Gaussian Q

\(Q(x; x_t) = \mathcal{N}(x_t, \sigma)\). By symmetry, the \(Q\)s in \(a\) cancel out.

Gibbs sampling

Assume we can sample from \(P^*(x_n | x_{!n})\), where \(x_{!n}\) is all elements of the hypothesis apart from \(x_n\).

We do Metropolis sampling, but while cycling through the variables in the hypothesis.

\(x_t \sim Q(x_{1, t}; x_{!1, t-1})\)
\(x_{t+1} \sim Q(x_{2, t+1}; x_{!2, t})\)
…

Picking step sizes in Metropolis methods

\[ \epsilon = l \]

We have \(k\) dimensions, of which \(\gamma\) have small acceptable width \(l\), and \(k - \gamma\) have large acceptable width \(L\).
How do we define the step size \(\epsilon\) when exploring the space?
We want a large step size: It takes \(T = \frac{L}{\epsilon}^2\) steps to explore the space, so we want to maximise \(\epsilon\).
We want a small step size: Acceptance rate is \((\frac{l}{\epsilon})^\gamma\), so if \(\epsilon \gg l\) then the acceptance rate falls to zero.
Thus, the middle ground is \(\epsilon = l\) with an acceptance rate of 50%.

Hamiltonian Monte Carlo

Uses momentum when selecting \(Q(x' ; x)\)

Slice sampling

Evaluate \(P^*(x) = y\).
Sample uniformly from \(z \sim [0, y]\).
“Step out” from \(y\) in steps of size \(w\) until \(P^*(y + i w) < z\).
- Start with a random offset drawn from \([-w/2, w/2]\).
- Do this in both directions, yielding \(y_l, y_r\).
Sample uniformly from \(y' \sim [y_l, y_r]\).
If \(P^*(y') > z\), accept.
If \(P^*(y') < z\), sample again from \([y_l, y_r]\).
- But this time don’t include samples “beyond” \(y'\) wrt. the starting point \(y\).

Sensitivity of w

Insensitive: If it’s too small, we have more stepout steps, which is linear. If it’s too big, we have more rejection steps, but also trim down the sampling space \([y_l, y_r]\).

Exact sampling

Sampling method where you can be certain you get a random sample.

We want a sample at \(t=0\).
Run two samples as far away from each other in sample space as possible, starting at \(t=-N\).
- Use the same RNG for both samples.
If the converge to the same state, then you know that all samples will converge.
- Thus, starting from \(t<-N\) would not tell us anything more about the state.
If they don’t converge, start again at \(t=-2N\).
Run the samples until \(t=0\), and you have a random sample.

Variational methods

Core idea

We’re interested in \(P(x) = \frac{1}{Z}P^*(x) = \frac{1}{Z}e^{-E(x)}\), but we can’t sample from \(E(x)\), only evaluate it.

So we approximate \(P(x)\) by a simpler distribution \(Q(x; \theta)\) and adjust \(\theta\) to get the best approximation.

KL divergence

\[ D_{KL}(Q || P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} \]

Effect of order of parameters

Zero probabilities

\(D_{KL}(Q || P)\) punishes when the approximate distribution puts mass where the true distribution is zero.

\(D_{KL}(P || Q)\) punishes when the approximate distribution puts zero mass where the true distribution is non-zero.

Computability

\(D_{KL}(A || B)\) performs an expectation over \(A\). Since we can’t sample from \(P\), we are effectively forced in to using \(D_{KL}(Q || P)\).

Variational free energy

We want to compute \(D_{KL}(Q || P)\). \(P(x)\) is not computable, \(E(x)\) is computable but not samplable. Thus we replace \(P\) with \(E\) and ignore the \(Z\) term as it is constant and thus not optimizable. This becomes the variational free energy.

\[ D_{KL}(Q || P) \\ \sum_x [ Q(x) \log \frac{Q(x)}{P(x)} ] \\ \sum_x [ Q(x) (\log Q(x) - \log P(x)) ] \\ \sum_x [ Q(x) (\log Q(x) - \log (\frac{1}{Z} e^{-E(x)})) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} - \log e^{-E(x)}) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} + E(x)) ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) - Q(x) \log \frac{1}{Z} ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] + \sum_x [ Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) \\ \]

Lectures 15-16: Data Modelling With Neural Networks

Capacity of a single neuron

Two bits. A neuron with \(k\) inputs can memorize \(2k\) random inputs.

Effect of in/bias/out stds

Input parameters: Higher has more curves, shorter time scales.
Bias parameters: Higher has more curves, same time scales.
Output parameters: Higher has large magnitude outputs.

Effect of regularization weight

If your cost function has a regularising term \(\alpha L_2\), then this is equivalent to having a prior on your weights being drawn from \(\mathcal{N}(0, \frac{1}{\alpha})\).