# Information theory

Going through lecture series: Information Theory, Pattern Recognition, and Neural Networks by David MacKay. The textbook is available for free online.

## Lecture 1: Introduction to Information Theory

### Sending information over noisy channels (formulation)

- We’re using systems to send information over
imperfect channels:
- Source message \(s\).
- Encoded into a transmission message \(t\).
- Sent across an imperfect channel which adds noise \(n\).
- Received on the other side as message \(r\).
- Decoded into a guess of \(s\), \(\hat{s}\).

- The encoder adds redundancy.
- The decoder infers \(n\) and \(s\).

#### Binary symmetric channel (BSC)

- \(s = t \in \{0, 1\}^+\)
- Noise is added by flipping each token in \(s\) with probability \(f\).

## Lectures 2-5: Entropy and Data Compression

### Ensembles

- An “Ensemble” \(X\)
is \((x, A_x, P_x)\)
- \(x\) is a random variable.
- \(A_x\) is an alphabet \(\{a_1, a_2, ...\}\).
- \(P_x\) is a
probability distribution over \(A_x\) \(\{p_1, p_2, ...\}\).
- \(\sum{P_x} = 1\)

### Shannon information content

- The Shannon information content of an outcome \(x = a_i\) is:
- \(h(x = a_i) = log_2(\frac{1}{P(x=a_i)})\)
- Measured in bits.

#### Likely vs. unlikely events

- Unlikely events have a lot of information content, likely events have little.

#### Relationship to compressed file length

- \(h(x=a_i)\) is the best possible compressed file length.

#### Additive

- \(h(x=a_i)\) is additive for independent random variables.
- \(h(x,y) = log_2(\frac{1}{P(x,y)})\)
- \(h(x,y) = log_2(\frac{1}{P(x)}) + log_2(\frac{1}{P(y)})\)
- \(h(x,y) = h(x) + h(y)\)

### Entropy

- The
**entropy**of an ensemble is the average Shannon information content.- \(H(X) = \sum_a{P(x=a) h(x=a)}\)
- \(H(X) = \sum_a{P(x=a) \log_2(\frac{1}{P(x=a)})}\)
- Measured in bits.

#### Binary entropy

\[ H_2(x) = x log \frac{1}{x} \]

### Source coding theorem

- \(N\) outcomes from a source \(X\) can be compressed into \(~NH(X)\) bits.

### Typical outcomes

- Source \(X\) produces \(N\) independent outcomes \(\textbf{x}=x_1,x_2,...\).
- Assume \(P_x\) has a long tail, i.e. some outcomes are much more likely than others.
- \(x_i\) is likely
to be one of \(~2^{NH(X)}\)
**typical**outcomes.

### Symbol codes

- Map from symbol \(x\) to codes \(c(x)\).
- Concatenate codes without punctuation.
- \([x_1,x_2,x_3] \to c(x_1)c(x_2)c(x_3)\)

#### Expected length

\(L(C, X) = \sum_x{P(x)len(C(x))}\)

### Kraft inequality

- If a code is uniquely decodable…
- \(\sum_j{2^{-l_i}} \leq 1\)
- When \(\sum_j{2^{-l_i}} =
1\), the code is
**complete**.

### Prove the length a message can’t be greater than its entropy

Proving \(L(C, X) \geq H(X)\).

- Alphabet \(x_i\), codes \(c_i\), code lengths \(l_i\), probabilities \(p_i\).
- Say that the ideal code lengths are \(l_i^{*} = h(x_i) = log_2(\frac{1}{p_i})\)
- Given we already have \(l_i\), we can say what the
“implicit” probability \(q_i\) of each \(x_i\) is:
- \(q_i = 2^{-l_i}\)
- But we need to make sure \(\sum{q_i} = 1\), so we
normalize by \(z\).
- \(z = \sum{2^{-l_i}}\)
- \(z \leq 1\) for any uniquely decodable code (see Kraft inequality).
- \(z = 1\) for any complete code.

- \(q_i = \frac{2^{-l_i}}{z} = \frac{2^{-l_i}}{\sum{2^{-l_i}}}\)
- So, \(l_i = log(\frac{1}{q_i}) - log(z)\)

- \(L(C, X) = \sum_i{p_i l_i}\)
- \(L(C, X) = \sum_i{p_i (log(\frac{1}{q_i}) - log(z))}\)
- \(L(C, X) = \sum_i{p_i log(\frac{1}{q_i})} - log(z)\)
- \(L(C, X) = \sum_i{p_i log(\frac{1}{p_i})} + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)\)
- \(L(C, X) = H(X) + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)\)
- \(L(C, X) = H(X) +
D_{KL}(p || q) - log(z)\)
- \(D_{KL}(p || q) = \sum_i{p_i log(\frac{p_i}{q_i})}\)
- \(D_{KL}(p || q) \geq 0\)

- \(-log(z) \geq 0\)
- So, \(L(C, X) \geq H(X)\)
- And \(L(C, X) =
H(X)\) if:
- There’s no difference between \(p\) and \(q\): \(D_{KL}(p || q) = 0\).
- You have a complete code: \(z = 1\).

- If ideal lengths \(l_i =
log(\frac{1}{p_i})\) are not integers:
- \(H \leq L < H + 1\)

### Huffman algorithm

- Making a binary tree.
- Given
*unconnected*leaf nodes with values \(p_1, p_2, ...\). - Take two smallest leaf nodes, and make a split node.
- Add the leaf node to the list of nodes.
- Repeat until there’s only one node in the list, the root node.

#### Main issue

- Symbol codes require \(\geq 1\) bit per character, even if \(H \ll 1\).
- For example:
- \(p = [0.01, 0.99]\)
- \(c = [0, 1]\)
- \(L = 1\)
- \(H = 0.01 log(1 / 0.01) + 0.99 log(1 / 0.99) = 0.08\)
- So, \(H \leq L < H + 1 \to 0.08 \leq 1 < 1.08\) holds, but this isn’t efficient!

### Arithmetic coding

- Given a string in the alphabet \(\textbf{x} = x_1, x_2, ...\)
- Given an oracle \(P(x_t | x_1, x_2, ..., x_{t-1})\).
- Picture \(\textbf{x}\) dividing the
interval \([0, 1]\)
repeatedly. For example:
- Alphabet \({a, b, c}\).
- \(P(x_1) = {0.5, 0.25, 0.25}\)
- Interval of \(x_1 = a\) is \([0, 0.5]\).
- Interval of \(x_1 = b\) is \([0.5, 0.75]\).
- Interval of \(x_1 = c\) is \([0.75, 1]\).
- Further elements \(x_2, x_3, ...\) further subdivide the space.

- Take the interval \([i_{lower}, i_{higher}]\) described by \(\textbf{x}\).
- Find a binary string \(\textbf{y}\) with \(P(y_n=1) = 0.5\) that subdivides to the same interval.
- \(\textbf{y}\) is now the encoded version of \(\textbf{x}\)!

#### Length of encoding

- \(L(\textbf{x}) \leq log \frac{1}{P(\textbf{x})} + 2\)
- \(L(\textbf{x}) \leq H(\textbf{x}) + 2\)

#### Computation

You only need to compute \(NI\) conditional probabilities, where \(N\) is the length of the string and \(I\) is the size of the alphabet.

#### Main advantage over Huffman algorithm

- Huffman codes use at least 1 bit per token.
- \(L(x) \leq H(x) < L(x) + 1\) where \(x\) is a single token.

- Arithmetic coding can use less than 1 bit per token.
- \(L(X) \leq H(X) + 2\) where \(X\) is the whole string.

#### Building simple predictor

- Take \([0, 6]\)-gram statistics.
- Use these to predict the next token.

## Lectures 6-8: Noisy channel coding

### Entropy of joint distributions

- \(H(X, Y) = \sum_{x \in X}
\sum_{y \in Y} P(x, y) h(x, y)\)
- where \(h(x, y) = log \frac{1}{P(x, y)}\)

- \(H(X | Y=y) = \sum_{x \in
X} P(x | Y=y) h(x | y)\)
- where \(h(x | Y=y) = log \frac{1}{P(x | Y=y)}\)

- \(H(X | Y) = \sum_{y \in Y} P(Y=y) H(X | Y=y)\)
- \(H(X) + H(Y) \geq H(X, Y)\)
- \(H(X) + H(Y) = H(X, Y)\) if \(X\) and \(Y\) are independent.
- \(H(X | Y) \leq H(X)\)
- \(H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y)\)

### Mutual information

- \(I(X ; Y) = H(X) - H(X | Y)\)
- \(I(X ; Y) = H(Y) - H(Y | X)\)
- \(I(X ; Y) = 1 - H(X | Y) - H(Y | X)\)

### Formalizing channels as matrices

- We formalize channels as a matrix \(Q\), where:
- \(Q_{i, j} = P(Y = b_j | X = a_i)\)
- \(a\) is the input alphabet of length \(i\).
- \(b\) is the output alphabet of length \(j\).

- We can find \(P(X | Y)\) using Bayes.

### Capacity of a channel

- \(C(Q) = \max_{P_x} I(X ;
Y)\)
- i.e. the maximum mutual information given we can choose the input probability distribution.

#### Optimal input distribution

- \(P_x\) chosen in \(C(Q) = \max_{P_x} I(X ; Y)\)

##### For BSC

- Optimal input distribution is \(P(x_i = 1) = 0.5\)
- Proof:
- \(p_1 = P(x_i = 1)\)
- \(I(X ; Y) = H(Y) - H(Y | X)\)
- \(I(X ; Y) = H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)\)
- \(max_{p_1} I(X ; Y)
=\)
- \(max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)\)
- \(max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f)\)
- Maximum value \(H_2\) can have is \(H_2(0.5) = 1\)
- \(0.5 = p_1 (1 - f) + (1 - p_1) f\)
- \(0.5 = p_1 - p_1 f + f - p_1 f\)
- \(0.5 = p_1 - f - 2 p_1 f\)
- \(0.5 = p_1 (1 - 2 f) - f\)
- \(p_1 = \frac{0.5 + f}{1 - 2 f} = 0.5\)

### Shannon’s noisy-channel coding theorem

- For any \(\epsilon > 0\), \(R < C\), and large enough \(N\)…
- There exists a code with length \(N\) and rate \(R\)…
- Such that the probability of a block error is \(< \epsilon\).

#### Parity check matrix

##### Matrix definition

We transfer 4 bits, \(t_1, t_2, t_3, t_4\).

We append 3 parity bits:

- \(t_5 = t_1 + t_2 + t_3\)
- \(t_6 = t_2 + t_3 + t_4\)
- \(t_7 = t_1 + t_3 + t_4\)

We encode the relationships within \(\textbf{t}\) as a matrix \(H\), where \(H_{i, j}\) represents whether bit \(i\) was included in \(t_{3+j}\)’s parity calculation.

\[ H = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 1 & 1 & 0 & 1 & 0\\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix} \]

- Note that the last three columns make up an identity matrix.

##### Encoding & decoding

- \(t\) is encoded such that \(Ht = [0, 0, 0]^T = \textbf{0}\)
- Noise is added to the received signal \(r = t + n\)
- Syndrome \(z = Hr = H(t + n) = Hr + Hn = Hn\)
- The column in \(H\) that is equal to \(z\) identifies which bit was flipped.

##### Generalizing >4

- Above, \(H\) is a \((7 \times 3)\) matrix where \(4\) source bits are transmitted for every \(7\) transmission bits.
- Generally, \(H\) is a \(N \times M\) matrix where \(T = N - M\) bits are transmitted for every \(N\) bits.
- Thus, the
*rate*is \(\frac{N - M}{N}\) or \(1 - \frac{M}{N}\).

#### Proving the theorem for BSC

- Say we have:
- A BSC with probability \(f\) of flipping, with a capacity of \(1 - H_2(f)\)
- A parity check matrix \(H \in \mathbb{R}^{N \times M}\) where \(N > M\).
- A source and noise that have \(N - M\) bits.
- A transmission that has \(N\) bits.
- A syndrome that has \(M\) bits.
- A “typical set syndrome decoder” \(M \in \mathbb{R}^M \times
\mathbb{R}^{N}\) that recovers the transmission
from the syndrome.
- This has \(2^{NH_2(f) + ...}\) entries.

- Transmissions have a rate \(R = 1 - \frac{M}{N}\).
- We want to prove \(R < C\) i.e. \(1 - \frac{M}{N} < 1 - H_2(f)\).
- \(P(error) = P_1 +
P_2\) where:
- \(P_1\) is the
probability of \(n\)
not being in \(M\).
- This is arbitrarily small as \(M\) contains the typical set.

- \(P_2\) is the
probability of there being a \(\hat{n} \neq n\) that has
the same syndrome \(z\).
- \(Hn = H\hat{n}\)
- \(H(n - \hat{n}) = 0\)
- We’ll focus on the constraints for \(P_2\).

- \(P_2 = \sum_n P(n)
1(\exists \hat{n}. \hat{n} \neq n, H(n - \hat{n}) =
0)\)
- (there exists at least one is less than the number of existences)

- \(P_2 \leq \sum_n P(n)
\sum_{\hat{n} \neq n} 1(H(n - \hat{n}) = 0)\)
- (we take the average \(H\) as it is simpler to not assume a specific \(H\))

- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} \sum_{H} P(H) 1(H(n - \hat{n}) = 0)\)
- Aside: \(P(H(n - \hat{n}) = 0 | n - \hat{n} \neq 0) = 2^{-M}\)
- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} 2^{-M}\)
- \(P_2 \leq \sum_{\hat{n} \neq n} 2^{-M}\)
- \(P_2 \leq 2^{NH_2(f)} \cdot 2^{-M}\)
- \(P_2 \leq 2^{NH_2(f)- M}\)
- So, \(P_2\)
vanishes as:
- \(M \gg NH_2(f)\)
- \(\frac{M}{N} > H_2(f)\)
- \(1 - \frac{M}{N} < 1 - H_2(f)\)
- \(R < C\)!!!

- \(P_1\) is the
probability of \(n\)
not being in \(M\).

### Entropy when probabilities are certain

\[ H_2(0) = P(0) \log(\frac{1}{0}) = 0 \]

\[ H_2(1) = P(1) \log(\frac{1}{1}) = 0 \]

### Feedback gem

The encoder knowing what noise is added
**doesn’t improve the rate**.

## Lectures 9-10: Introduction to Bayesian inference

No notes, covers the basics.

## Lectures 11-14: Approximating Probability Distributions

### K-means

Clustering algorithm. Given \(\{x_n\}\) data points to put into \(K\) clusters:

- Assign cluster means \(\{\mu_k\}\) randomly.
- Assign each data point to the closest cluster: \(r^n_k = \arg\min_{k} (x_n - \mu_n)^2 = k\)
- Update cluster means to the mean of its data points: \(\mu_k = \sum_n{r^n_k x_n} / \sum{r^n_k}\)
- Go to 2.

#### Bayesian interpretation

K-means is a Maximum A Posteriori (MAP) algorithm which assumes:

- There are \(K\) clusters.
- The probability of being in each cluster is equivalent.
- Each cluster has the same variance.

These assumptions might not be necessary. Additionally, MAP can result in incorrect cluster centers due to a hard boundary pushing clusters apart.

#### Soft K-means

Cluster responsibilities are no longer binary. Introduces an additional \(\beta\) hyperparameter. Update rule is the same.

\[ r^n_k = \frac{\exp(-\beta d(x_n, \mu_k))}{\sum_{k'} \exp(-\beta d(x_n, \mu_{k'}))} \]

### Monte Carlo methods

#### The problem

We’re given:

- \(P(x) =
\frac{P^*(x)}{Z}\), some probability distribution
*that we can’t sample from*. - \(Q(x)\), some
simple probability distribution (e.g. uniform, gaussian)
*that we can sample from*. - \(\phi(x)\), some function of interest that we want to run over \(P\).

We want to either be able to draw samples from \(P\), or to calculate the value of \(\phi\).

#### Importance sampling

\[ \Phi = \frac{\sum_n \frac{P^*(x_n)}{Q(x_n)} \phi(x_n)}{\sum_n \frac{P^*(x_n)}{Q(x_n)}} \]

where \(\{x_n\}_N\) are \(N\) samples drawn from \(Q\).

##### Choice of Q

If \(Q\) has little overlap with \(P\), we’ll be sampling from the wrong part of the distribution. Although we’ll still eventually converge, this can take a while and its impossible to tell that we’re converging.

#### Rejection sampling

Given a \(Q\) and constant \(c\) such that:

\[ \forall x. cQ(x) \ge P^*(x) \]

- Sample \(x \sim Q\).
- Sample \(u \sim \text{Uniform}(0, cQ(x))\).
- Store \(x\) if \(u < P^*(x)\), reject otherwise.

This provides \(\{x\}\) that are sampled from \(P\).

### Markov Chain Monte Carlo (MCMC) methods

Different to Monte Carlo methods our samples are no longer independent. We draw from \(Q(x; x')\) rather than \(Q(x)\).

This works well when \(Q\) and \(P\) are very different.

#### Metropolis sampling

- Sample \(x \sim Q(x; x_t)\).
- Calculate \(a = \frac{P^*(x) Q(x_t; x)}{P^*(x_t) Q(x; x_t)}\).
- If \(a>1\),
accept, otherwise accept with probability \(a\).
- Accept means \(x_{t+1} = x\).
- Reject means \(x_{t+1} = x_t\).

Asymptotically, \(\{x\}\) is sampled from
\(P\), but **they
won’t be independent**.

##### Gaussian Q

\(Q(x; x_t) = \mathcal{N}(x_t, \sigma)\). By symmetry, the \(Q\)s in \(a\) cancel out.

#### Gibbs sampling

Assume we can sample from \(P^*(x_n | x_{!n})\), where \(x_{!n}\) is all elements of the hypothesis apart from \(x_n\).

We do Metropolis sampling, but while cycling through the variables in the hypothesis.

- \(x_t \sim Q(x_{1, t}; x_{!1, t-1})\)
- \(x_{t+1} \sim Q(x_{2, t+1}; x_{!2, t})\)
- …

### Picking step sizes in Metropolis methods

\[ \epsilon = l \]

- We have \(k\) dimensions, of which \(\gamma\) have small acceptable width \(l\), and \(k - \gamma\) have large acceptable width \(L\).
- How do we define the step size \(\epsilon\) when exploring the space?
- We want a
**large step size**: It takes \(T = \frac{L}{\epsilon}^2\) steps to explore the space, so we want to maximise \(\epsilon\). - We want a
**small step size**: Acceptance rate is \((\frac{l}{\epsilon})^\gamma\), so if \(\epsilon \gg l\) then the acceptance rate falls to zero. - Thus, the middle ground is \(\epsilon = l\) with an acceptance rate of 50%.

### Hamiltonian Monte Carlo

Uses momentum when selecting \(Q(x' ; x)\)

### Slice sampling

- Evaluate \(P^*(x) = y\).
- Sample uniformly from \(z \sim [0, y]\).
- “Step out” from \(y\) in steps of size \(w\) until \(P^*(y + i w) < z\).
- Start with a random offset drawn from \([-w/2, w/2]\).
- Do this in both directions, yielding \(y_l, y_r\).

- Sample uniformly from \(y' \sim [y_l, y_r]\).
- If \(P^*(y') > z\), accept.
- If \(P^*(y') <
z\), sample again from \([y_l, y_r]\).
- But this time don’t include samples “beyond” \(y'\) wrt. the starting point \(y\).

#### Sensitivity of w

Insensitive: If it’s too small, we have more stepout steps, which is linear. If it’s too big, we have more rejection steps, but also trim down the sampling space \([y_l, y_r]\).

### Exact sampling

Sampling method where you can be certain you get a random sample.

- We want a sample at \(t=0\).
- Run two samples as far away from each other in
sample space as possible, starting at \(t=-N\).
- Use the same RNG for both samples.

- If the converge to the same state, then you know
that
*all*samples will converge.- Thus, starting from \(t<-N\) would not tell us anything more about the state.

- If they don’t converge, start again at \(t=-2N\).
- Run the samples until \(t=0\), and you have a random sample.

### Variational methods

#### Core idea

We’re interested in \(P(x) = \frac{1}{Z}P^*(x) = \frac{1}{Z}e^{-E(x)}\), but we can’t sample from \(E(x)\), only evaluate it.

So we approximate \(P(x)\) by a simpler distribution \(Q(x; \theta)\) and adjust \(\theta\) to get the best approximation.

#### KL divergence

\[ D_{KL}(Q || P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} \]

##### Effect of order of parameters

##### Zero probabilities

\(D_{KL}(Q || P)\) punishes when the approximate distribution puts mass where the true distribution is zero.

\(D_{KL}(P || Q)\) punishes when the approximate distribution puts zero mass where the true distribution is non-zero.

##### Computability

\(D_{KL}(A || B)\) performs an expectation over \(A\). Since we can’t sample from \(P\), we are effectively forced in to using \(D_{KL}(Q || P)\).

#### Variational free energy

We want to compute \(D_{KL}(Q || P)\). \(P(x)\) is not computable,
\(E(x)\) is computable
but not samplable. Thus we replace \(P\) with \(E\) and ignore the \(Z\) term as it is constant
and thus not optimizable. This becomes the
**variational free energy**.

\[ D_{KL}(Q || P) \\ \sum_x [ Q(x) \log \frac{Q(x)}{P(x)} ] \\ \sum_x [ Q(x) (\log Q(x) - \log P(x)) ] \\ \sum_x [ Q(x) (\log Q(x) - \log (\frac{1}{Z} e^{-E(x)})) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} - \log e^{-E(x)}) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} + E(x)) ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) - Q(x) \log \frac{1}{Z} ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] + \sum_x [ Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) \\ \]

## Lectures 15-16: Data Modelling With Neural Networks

### Capacity of a single neuron

Two bits. A neuron with \(k\) inputs can memorize \(2k\) random inputs.

### Effect of in/bias/out stds

- Input parameters: Higher has more curves, shorter time scales.
- Bias parameters: Higher has more curves, same time scales.
- Output parameters: Higher has large magnitude outputs.

### Effect of regularization weight

If your cost function has a regularising term \(\alpha L_2\), then this is equivalent to having a prior on your weights being drawn from \(\mathcal{N}(0, \frac{1}{\alpha})\).