# Information theory

Going through lecture series: Information Theory, Pattern Recognition, and Neural Networks by David MacKay. The textbook is available for free online.

## Lecture 1: Introduction to Information Theory

### Sending information over noisy channels (formulation)

• We’re using systems to send information over imperfect channels:
• Source message $$s$$.
• Encoded into a transmission message $$t$$.
• Sent across an imperfect channel which adds noise $$n$$.
• Received on the other side as message $$r$$.
• Decoded into a guess of $$s$$, $$\hat{s}$$.
• The decoder infers $$n$$ and $$s$$.

#### Binary symmetric channel (BSC)

• $$s = t \in \{0, 1\}^+$$
• Noise is added by flipping each token in $$s$$ with probability $$f$$.

## Lectures 2-5: Entropy and Data Compression

### Ensembles

• An “Ensemble” $$X$$ is $$(x, A_x, P_x)$$
• $$x$$ is a random variable.
• $$A_x$$ is an alphabet $$\{a_1, a_2, ...\}$$.
• $$P_x$$ is a probability distribution over $$A_x$$ $$\{p_1, p_2, ...\}$$.
• $$\sum{P_x} = 1$$

### Shannon information content

• The Shannon information content of an outcome $$x = a_i$$ is:
• $$h(x = a_i) = log_2(\frac{1}{P(x=a_i)})$$
• Measured in bits.

#### Likely vs. unlikely events

• Unlikely events have a lot of information content, likely events have little.

#### Relationship to compressed file length

• $$h(x=a_i)$$ is the best possible compressed file length.

• $$h(x=a_i)$$ is additive for independent random variables.
• $$h(x,y) = log_2(\frac{1}{P(x,y)})$$
• $$h(x,y) = log_2(\frac{1}{P(x)}) + log_2(\frac{1}{P(y)})$$
• $$h(x,y) = h(x) + h(y)$$

### Entropy

• The entropy of an ensemble is the average Shannon information content.
• $$H(X) = \sum_a{P(x=a) h(x=a)}$$
• $$H(X) = \sum_a{P(x=a) \log_2(\frac{1}{P(x=a)})}$$
• Measured in bits.

#### Binary entropy

$H_2(x) = x log \frac{1}{x}$

### Source coding theorem

• $$N$$ outcomes from a source $$X$$ can be compressed into $$~NH(X)$$ bits.

### Typical outcomes

• Source $$X$$ produces $$N$$ independent outcomes $$\textbf{x}=x_1,x_2,...$$.
• Assume $$P_x$$ has a long tail, i.e. some outcomes are much more likely than others.
• $$x_i$$ is likely to be one of $$~2^{NH(X)}$$ typical outcomes.

### Symbol codes

• Map from symbol $$x$$ to codes $$c(x)$$.
• Concatenate codes without punctuation.
• $$[x_1,x_2,x_3] \to c(x_1)c(x_2)c(x_3)$$

#### Expected length

$$L(C, X) = \sum_x{P(x)len(C(x))}$$

### Kraft inequality

• If a code is uniquely decodable…
• $$\sum_j{2^{-l_i}} \leq 1$$
• When $$\sum_j{2^{-l_i}} = 1$$, the code is complete.

### Prove the length a message can’t be greater than its entropy

Proving $$L(C, X) \geq H(X)$$.

• Alphabet $$x_i$$, codes $$c_i$$, code lengths $$l_i$$, probabilities $$p_i$$.
• Say that the ideal code lengths are $$l_i^{*} = h(x_i) = log_2(\frac{1}{p_i})$$
• Given we already have $$l_i$$, we can say what the “implicit” probability $$q_i$$ of each $$x_i$$ is:
• $$q_i = 2^{-l_i}$$
• But we need to make sure $$\sum{q_i} = 1$$, so we normalize by $$z$$.
• $$z = \sum{2^{-l_i}}$$
• $$z \leq 1$$ for any uniquely decodable code (see Kraft inequality).
• $$z = 1$$ for any complete code.
• $$q_i = \frac{2^{-l_i}}{z} = \frac{2^{-l_i}}{\sum{2^{-l_i}}}$$
• So, $$l_i = log(\frac{1}{q_i}) - log(z)$$
• $$L(C, X) = \sum_i{p_i l_i}$$
• $$L(C, X) = \sum_i{p_i (log(\frac{1}{q_i}) - log(z))}$$
• $$L(C, X) = \sum_i{p_i log(\frac{1}{q_i})} - log(z)$$
• $$L(C, X) = \sum_i{p_i log(\frac{1}{p_i})} + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)$$
• $$L(C, X) = H(X) + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)$$
• $$L(C, X) = H(X) + D_{KL}(p || q) - log(z)$$
• $$D_{KL}(p || q) = \sum_i{p_i log(\frac{p_i}{q_i})}$$
• $$D_{KL}(p || q) \geq 0$$
• $$-log(z) \geq 0$$
• So, $$L(C, X) \geq H(X)$$
• And $$L(C, X) = H(X)$$ if:
• There’s no difference between $$p$$ and $$q$$: $$D_{KL}(p || q) = 0$$.
• You have a complete code: $$z = 1$$.
• If ideal lengths $$l_i = log(\frac{1}{p_i})$$ are not integers:
• $$H \leq L < H + 1$$

### Huffman algorithm

• Making a binary tree.
• Given unconnected leaf nodes with values $$p_1, p_2, ...$$.
• Take two smallest leaf nodes, and make a split node.
• Add the leaf node to the list of nodes.
• Repeat until there’s only one node in the list, the root node.

#### Main issue

• Symbol codes require $$\geq 1$$ bit per character, even if $$H \ll 1$$.
• For example:
• $$p = [0.01, 0.99]$$
• $$c = [0, 1]$$
• $$L = 1$$
• $$H = 0.01 log(1 / 0.01) + 0.99 log(1 / 0.99) = 0.08$$
• So, $$H \leq L < H + 1 \to 0.08 \leq 1 < 1.08$$ holds, but this isn’t efficient!

### Arithmetic coding

• Given a string in the alphabet $$\textbf{x} = x_1, x_2, ...$$
• Given an oracle $$P(x_t | x_1, x_2, ..., x_{t-1})$$.
• Picture $$\textbf{x}$$ dividing the interval $$[0, 1]$$ repeatedly. For example:
• Alphabet $${a, b, c}$$.
• $$P(x_1) = {0.5, 0.25, 0.25}$$
• Interval of $$x_1 = a$$ is $$[0, 0.5]$$.
• Interval of $$x_1 = b$$ is $$[0.5, 0.75]$$.
• Interval of $$x_1 = c$$ is $$[0.75, 1]$$.
• Further elements $$x_2, x_3, ...$$ further subdivide the space.
• Take the interval $$[i_{lower}, i_{higher}]$$ described by $$\textbf{x}$$.
• Find a binary string $$\textbf{y}$$ with $$P(y_n=1) = 0.5$$ that subdivides to the same interval.
• $$\textbf{y}$$ is now the encoded version of $$\textbf{x}$$!

#### Length of encoding

• $$L(\textbf{x}) \leq log \frac{1}{P(\textbf{x})} + 2$$
• $$L(\textbf{x}) \leq H(\textbf{x}) + 2$$

#### Computation

You only need to compute $$NI$$ conditional probabilities, where $$N$$ is the length of the string and $$I$$ is the size of the alphabet.

#### Main advantage over Huffman algorithm

• Huffman codes use at least 1 bit per token.
• $$L(x) \leq H(x) < L(x) + 1$$ where $$x$$ is a single token.
• Arithmetic coding can use less than 1 bit per token.
• $$L(X) \leq H(X) + 2$$ where $$X$$ is the whole string.

#### Building simple predictor

• Take $$[0, 6]$$-gram statistics.
• Use these to predict the next token.

## Lectures 6-8: Noisy channel coding

### Entropy of joint distributions

• $$H(X, Y) = \sum_{x \in X} \sum_{y \in Y} P(x, y) h(x, y)$$
• where $$h(x, y) = log \frac{1}{P(x, y)}$$
• $$H(X | Y=y) = \sum_{x \in X} P(x | Y=y) h(x | y)$$
• where $$h(x | Y=y) = log \frac{1}{P(x | Y=y)}$$
• $$H(X | Y) = \sum_{y \in Y} P(Y=y) H(X | Y=y)$$
• $$H(X) + H(Y) \geq H(X, Y)$$
• $$H(X) + H(Y) = H(X, Y)$$ if $$X$$ and $$Y$$ are independent.
• $$H(X | Y) \leq H(X)$$
• $$H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y)$$

### Mutual information

• $$I(X ; Y) = H(X) - H(X | Y)$$
• $$I(X ; Y) = H(Y) - H(Y | X)$$
• $$I(X ; Y) = 1 - H(X | Y) - H(Y | X)$$

### Formalizing channels as matrices

• We formalize channels as a matrix $$Q$$, where:
• $$Q_{i, j} = P(Y = b_j | X = a_i)$$
• $$a$$ is the input alphabet of length $$i$$.
• $$b$$ is the output alphabet of length $$j$$.
• We can find $$P(X | Y)$$ using Bayes.

### Capacity of a channel

• $$C(Q) = \max_{P_x} I(X ; Y)$$
• i.e. the maximum mutual information given we can choose the input probability distribution.

#### Optimal input distribution

• $$P_x$$ chosen in $$C(Q) = \max_{P_x} I(X ; Y)$$
##### For BSC
• Optimal input distribution is $$P(x_i = 1) = 0.5$$
• Proof:
• $$p_1 = P(x_i = 1)$$
• $$I(X ; Y) = H(Y) - H(Y | X)$$
• $$I(X ; Y) = H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)$$
• $$max_{p_1} I(X ; Y) =$$
• $$max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)$$
• $$max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f)$$
• Maximum value $$H_2$$ can have is $$H_2(0.5) = 1$$
• $$0.5 = p_1 (1 - f) + (1 - p_1) f$$
• $$0.5 = p_1 - p_1 f + f - p_1 f$$
• $$0.5 = p_1 - f - 2 p_1 f$$
• $$0.5 = p_1 (1 - 2 f) - f$$
• $$p_1 = \frac{0.5 + f}{1 - 2 f} = 0.5$$

### Shannon’s noisy-channel coding theorem

• For any $$\epsilon > 0$$, $$R < C$$, and large enough $$N$$
• There exists a code with length $$N$$ and rate $$R$$
• Such that the probability of a block error is $$< \epsilon$$.

#### Parity check matrix

##### Matrix definition
• We transfer 4 bits, $$t_1, t_2, t_3, t_4$$.
• We append 3 parity bits:
• $$t_5 = t_1 + t_2 + t_3$$
• $$t_6 = t_2 + t_3 + t_4$$
• $$t_7 = t_1 + t_3 + t_4$$
• We encode the relationships within $$\textbf{t}$$ as a matrix $$H$$, where $$H_{i, j}$$ represents whether bit $$i$$ was included in $$t_{3+j}$$’s parity calculation. $H = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 1 & 1 & 0 & 1 & 0\\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix}$
• Note that the last three columns make up an identity matrix.
##### Encoding & decoding
• $$t$$ is encoded such that $$Ht = [0, 0, 0]^T = \textbf{0}$$
• Noise is added to the received signal $$r = t + n$$
• Syndrome $$z = Hr = H(t + n) = Hr + Hn = Hn$$
• The column in $$H$$ that is equal to $$z$$ identifies which bit was flipped.
##### Generalizing >4
• Above, $$H$$ is a $$(7 \times 3)$$ matrix where $$4$$ source bits are transmitted for every $$7$$ transmission bits.
• Generally, $$H$$ is a $$N \times M$$ matrix where $$T = N - M$$ bits are transmitted for every $$N$$ bits.
• Thus, the rate is $$\frac{N - M}{N}$$ or $$1 - \frac{M}{N}$$.

#### Proving the theorem for BSC

• Say we have:
• A BSC with probability $$f$$ of flipping, with a capacity of $$1 - H_2(f)$$
• A parity check matrix $$H \in \mathbb{R}^{N \times M}$$ where $$N > M$$.
• A source and noise that have $$N - M$$ bits.
• A transmission that has $$N$$ bits.
• A syndrome that has $$M$$ bits.
• A “typical set syndrome decoder” $$M \in \mathbb{R}^M \times \mathbb{R}^{N}$$ that recovers the transmission from the syndrome.
• This has $$2^{NH_2(f) + ...}$$ entries.
• Transmissions have a rate $$R = 1 - \frac{M}{N}$$.
• We want to prove $$R < C$$ i.e. $$1 - \frac{M}{N} < 1 - H_2(f)$$.
• $$P(error) = P_1 + P_2$$ where:
• $$P_1$$ is the probability of $$n$$ not being in $$M$$.
• This is arbitrarily small as $$M$$ contains the typical set.
• $$P_2$$ is the probability of there being a $$\hat{n} \neq n$$ that has the same syndrome $$z$$.
• $$Hn = H\hat{n}$$
• $$H(n - \hat{n}) = 0$$
• We’ll focus on the constraints for $$P_2$$.
• $$P_2 = \sum_n P(n) 1(\exists \hat{n}. \hat{n} \neq n, H(n - \hat{n}) = 0)$$
• (there exists at least one is less than the number of existences)
• $$P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} 1(H(n - \hat{n}) = 0)$$
• (we take the average $$H$$ as it is simpler to not assume a specific $$H$$)
• $$P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} \sum_{H} P(H) 1(H(n - \hat{n}) = 0)$$
• Aside: $$P(H(n - \hat{n}) = 0 | n - \hat{n} \neq 0) = 2^{-M}$$
• $$P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} 2^{-M}$$
• $$P_2 \leq \sum_{\hat{n} \neq n} 2^{-M}$$
• $$P_2 \leq 2^{NH_2(f)} \cdot 2^{-M}$$
• $$P_2 \leq 2^{NH_2(f)- M}$$
• So, $$P_2$$ vanishes as:
• $$M \gg NH_2(f)$$
• $$\frac{M}{N} > H_2(f)$$
• $$1 - \frac{M}{N} < 1 - H_2(f)$$
• $$R < C$$!!!

### Entropy when probabilities are certain

$H_2(0) = P(0) \log(\frac{1}{0}) = 0$

$H_2(1) = P(1) \log(\frac{1}{1}) = 0$

### Feedback gem

The encoder knowing what noise is added doesn’t improve the rate.

## Lectures 9-10: Introduction to Bayesian inference

No notes, covers the basics.

## Lectures 11-14: Approximating Probability Distributions

### K-means

Clustering algorithm. Given $$\{x_n\}$$ data points to put into $$K$$ clusters:

1. Assign cluster means $$\{\mu_k\}$$ randomly.
2. Assign each data point to the closest cluster: $$r^n_k = \arg\min_{k} (x_n - \mu_n)^2 = k$$
3. Update cluster means to the mean of its data points: $$\mu_k = \sum_n{r^n_k x_n} / \sum{r^n_k}$$
4. Go to 2.

#### Bayesian interpretation

K-means is a Maximum A Posteriori (MAP) algorithm which assumes:

• There are $$K$$ clusters.
• The probability of being in each cluster is equivalent.
• Each cluster has the same variance.

These assumptions might not be necessary. Additionally, MAP can result in incorrect cluster centers due to a hard boundary pushing clusters apart.

#### Soft K-means

Cluster responsibilities are no longer binary. Introduces an additional $$\beta$$ hyperparameter. Update rule is the same.

$r^n_k = \frac{\exp(-\beta d(x_n, \mu_k))}{\sum_{k'} \exp(-\beta d(x_n, \mu_{k'}))}$

### Monte Carlo methods

#### The problem

We’re given: - $$P(x) = \frac{P^*(x)}{Z}$$, some probability distribution that we can’t sample from. - $$Q(x)$$, some simple probability distribution (e.g. uniform, gaussian) that we can sample from. - $$\phi(x)$$, some function of interest that we want to run over $$P$$.

We want to either be able to draw samples from $$P$$, or to calculate the value of $$\phi$$.

#### Importance sampling

$\Phi = \frac{\sum_n \frac{P^*(x_n)}{Q(x_n)} \phi(x_n)}{\sum_n \frac{P^*(x_n)}{Q(x_n)}}$

where $$\{x_n\}_N$$ are $$N$$ samples drawn from $$Q$$.

##### Choice of Q

If $$Q$$ has little overlap with $$P$$, we’ll be sampling from the wrong part of the distribution. Although we’ll still eventually converge, this can take a while and its impossible to tell that we’re converging.

#### Rejection sampling

Given a $$Q$$ and constant $$c$$ such that: $\forall x. cQ(x) \ge P^*(x)$

1. Sample $$x \sim Q$$.
2. Sample $$u \sim \text{Uniform}(0, cQ(x))$$.
3. Store $$x$$ if $$u < P^*(x)$$, reject otherwise.

This provides $$\{x\}$$ that are sampled from $$P$$.

### Markov Chain Monte Carlo (MCMC) methods

Different to Monte Carlo methods our samples are no longer independent. We draw from $$Q(x; x')$$ rather than $$Q(x)$$.

This works well when $$Q$$ and $$P$$ are very different.

#### Metropolis sampling

1. Sample $$x \sim Q(x; x_t)$$.
2. Calculate $$a = \frac{P^*(x) Q(x_t; x)}{P^*(x_t) Q(x; x_t)}$$.
3. If $$a>1$$, accept, otherwise accept with probability $$a$$.
• Accept means $$x_{t+1} = x$$.
• Reject means $$x_{t+1} = x_t$$.

Asymptotically, $$\{x\}$$ is sampled from $$P$$, but they won’t be independent.

##### Gaussian Q

$$Q(x; x_t) = \mathcal{N}(x_t, \sigma)$$. By symmetry, the $$Q$$s in $$a$$ cancel out.

#### Gibbs sampling

Assume we can sample from $$P^*(x_n | x_{!n})$$, where $$x_{!n}$$ is all elements of the hypothesis apart from $$x_n$$.

We do Metropolis sampling, but while cycling through the variables in the hypothesis.

• $$x_t \sim Q(x_{1, t}; x_{!1, t-1})$$
• $$x_{t+1} \sim Q(x_{2, t+1}; x_{!2, t})$$

### Picking step sizes in Metropolis methods

$\epsilon = l$

• We have $$k$$ dimensions, of which $$\gamma$$ have small acceptable width $$l$$, and $$k - \gamma$$ have large acceptable width $$L$$.
• How do we define the step size $$\epsilon$$ when exploring the space?
• We want a large step size: It takes $$T = \frac{L}{\epsilon}^2$$ steps to explore the space, so we want to maximise $$\epsilon$$.
• We want a small step size: Acceptance rate is $$(\frac{l}{\epsilon})^\gamma$$, so if $$\epsilon \gg l$$ then the acceptance rate falls to zero.
• Thus, the middle ground is $$\epsilon = l$$ with an acceptance rate of 50%.

### Hamiltonian Monte Carlo

Uses momentum when selecting $$Q(x' ; x)$$

### Slice sampling

• Evaluate $$P^*(x) = y$$.
• Sample uniformly from $$z \sim [0, y]$$.
• “Step out” from $$y$$ in steps of size $$w$$ until $$P^*(y + i w) < z$$.
• Start with a random offset drawn from $$[-w/2, w/2]$$.
• Do this in both directions, yielding $$y_l, y_r$$.
• Sample uniformly from $$y' \sim [y_l, y_r]$$.
• If $$P^*(y') > z$$, accept.
• If $$P^*(y') < z$$, sample again from $$[y_l, y_r]$$.
• But this time don’t include samples “beyond” $$y'$$ wrt. the starting point $$y$$.

#### Sensitivity of w

Insensitive: If it’s too small, we have more stepout steps, which is linear. If it’s too big, we have more rejection steps, but also trim down the sampling space $$[y_l, y_r]$$.

### Exact sampling

Sampling method where you can be certain you get a random sample.

• We want a sample at $$t=0$$.
• Run two samples as far away from each other in sample space as possible, starting at $$t=-N$$.
• Use the same RNG for both samples.
• If the converge to the same state, then you know that all samples will converge.
• Thus, starting from $$t<-N$$ would not tell us anything more about the state.
• If they don’t converge, start again at $$t=-2N$$.
• Run the samples until $$t=0$$, and you have a random sample.

### Variational methods

#### Core idea

We’re interested in $$P(x) = \frac{1}{Z}P^*(x) = \frac{1}{Z}e^{-E(x)}$$, but we can’t sample from $$E(x)$$, only evaluate it.

So we approximate $$P(x)$$ by a simpler distribution $$Q(x; \theta)$$ and adjust $$\theta$$ to get the best approximation.

#### KL divergence

$D_{KL}(Q || P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)}$

##### Zero probabilities

$$D_{KL}(Q || P)$$ punishes when the approximate distribution puts mass where the true distribution is zero.

$$D_{KL}(P || Q)$$ punishes when the approximate distribution puts zero mass where the true distribution is non-zero.

##### Computability

$$D_{KL}(A || B)$$ performs an expectation over $$A$$. Since we can’t sample from $$P$$, we are effectively forced in to using $$D_{KL}(Q || P)$$.

#### Variational free energy

We want to compute $$D_{KL}(Q || P)$$. $$P(x)$$ is not computable, $$E(x)$$ is computable but not samplable. Thus we replace $$P$$ with $$E$$ and ignore the $$Z$$ term as it is constant and thus not optimizable. This becomes the variational free energy.

$D_{KL}(Q || P) \\ \sum_x [ Q(x) \log \frac{Q(x)}{P(x)} ] \\ \sum_x [ Q(x) (\log Q(x) - \log P(x)) ] \\ \sum_x [ Q(x) (\log Q(x) - \log (\frac{1}{Z} e^{-E(x)})) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} - \log e^{-E(x)}) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} + E(x)) ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) - Q(x) \log \frac{1}{Z} ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] + \sum_x [ Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) \\$

## Lectures 15-16: Data Modelling With Neural Networks

### Capacity of a single neuron

Two bits. A neuron with $$k$$ inputs can memorize $$2k$$ random inputs.

### Effect of in/bias/out stds

• Input parameters: Higher has more curves, shorter time scales.
• Bias parameters: Higher has more curves, same time scales.
• Output parameters: Higher has large magnitude outputs.

### Effect of regularization weight

If your cost function has a regularising term $$\alpha L_2$$, then this is equivalent to having a prior on your weights being drawn from $$\mathcal{N}(0, \frac{1}{\alpha})$$.