Information theory
Going through lecture series: Information Theory, Pattern Recognition, and Neural Networks by David MacKay. The textbook is available for free online.
Lecture 1: Introduction to Information Theory
Sending information over noisy channels (formulation)
- We’re using systems to send information over
imperfect channels:
- Source message \(s\).
- Encoded into a transmission message \(t\).
- Sent across an imperfect channel which adds noise \(n\).
- Received on the other side as message \(r\).
- Decoded into a guess of \(s\), \(\hat{s}\).
- The encoder adds redundancy.
- The decoder infers \(n\) and \(s\).
Binary symmetric channel (BSC)
- \(s = t \in \{0, 1\}^+\)
- Noise is added by flipping each token in \(s\) with probability \(f\).
Lectures 2-5: Entropy and Data Compression
Ensembles
- An “Ensemble” \(X\)
is \((x, A_x, P_x)\)
- \(x\) is a random variable.
- \(A_x\) is an alphabet \(\{a_1, a_2, ...\}\).
- \(P_x\) is a
probability distribution over \(A_x\) \(\{p_1, p_2, ...\}\).
- \(\sum{P_x} = 1\)
Shannon information content
- The Shannon information content of an outcome \(x = a_i\) is:
- \(h(x = a_i) = log_2(\frac{1}{P(x=a_i)})\)
- Measured in bits.
Likely vs. unlikely events
- Unlikely events have a lot of information content, likely events have little.
Relationship to compressed file length
- \(h(x=a_i)\) is the best possible compressed file length.
Additive
- \(h(x=a_i)\) is additive for independent random variables.
- \(h(x,y) = log_2(\frac{1}{P(x,y)})\)
- \(h(x,y) = log_2(\frac{1}{P(x)}) + log_2(\frac{1}{P(y)})\)
- \(h(x,y) = h(x) + h(y)\)
Entropy
- The entropy of an ensemble is the
average Shannon information content.
- \(H(X) = \sum_a{P(x=a) h(x=a)}\)
- \(H(X) = \sum_a{P(x=a) \log_2(\frac{1}{P(x=a)})}\)
- Measured in bits.
Binary entropy
\[ H_2(x) = x log \frac{1}{x} \]
Source coding theorem
- \(N\) outcomes from a source \(X\) can be compressed into \(~NH(X)\) bits.
Typical outcomes
- Source \(X\) produces \(N\) independent outcomes \(\textbf{x}=x_1,x_2,...\).
- Assume \(P_x\) has a long tail, i.e. some outcomes are much more likely than others.
- \(x_i\) is likely to be one of \(~2^{NH(X)}\) typical outcomes.
Symbol codes
- Map from symbol \(x\) to codes \(c(x)\).
- Concatenate codes without punctuation.
- \([x_1,x_2,x_3] \to c(x_1)c(x_2)c(x_3)\)
Expected length
\(L(C, X) = \sum_x{P(x)len(C(x))}\)
Kraft inequality
- If a code is uniquely decodable…
- \(\sum_j{2^{-l_i}} \leq 1\)
- When \(\sum_j{2^{-l_i}} = 1\), the code is complete.
Prove the length a message can’t be greater than its entropy
Proving \(L(C, X) \geq H(X)\).
- Alphabet \(x_i\), codes \(c_i\), code lengths \(l_i\), probabilities \(p_i\).
- Say that the ideal code lengths are \(l_i^{*} = h(x_i) = log_2(\frac{1}{p_i})\)
- Given we already have \(l_i\), we can say what the
“implicit” probability \(q_i\) of each \(x_i\) is:
- \(q_i = 2^{-l_i}\)
- But we need to make sure \(\sum{q_i} = 1\), so we
normalize by \(z\).
- \(z = \sum{2^{-l_i}}\)
- \(z \leq 1\) for any uniquely decodable code (see Kraft inequality).
- \(z = 1\) for any complete code.
- \(q_i = \frac{2^{-l_i}}{z} = \frac{2^{-l_i}}{\sum{2^{-l_i}}}\)
- So, \(l_i = log(\frac{1}{q_i}) - log(z)\)
- \(L(C, X) = \sum_i{p_i l_i}\)
- \(L(C, X) = \sum_i{p_i (log(\frac{1}{q_i}) - log(z))}\)
- \(L(C, X) = \sum_i{p_i log(\frac{1}{q_i})} - log(z)\)
- \(L(C, X) = \sum_i{p_i log(\frac{1}{p_i})} + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)\)
- \(L(C, X) = H(X) + \sum_i{p_i log(\frac{p_i}{q_i})} - log(z)\)
- \(L(C, X) = H(X) +
D_{KL}(p || q) - log(z)\)
- \(D_{KL}(p || q) = \sum_i{p_i log(\frac{p_i}{q_i})}\)
- \(D_{KL}(p || q) \geq 0\)
- \(-log(z) \geq 0\)
- So, \(L(C, X) \geq H(X)\)
- And \(L(C, X) =
H(X)\) if:
- There’s no difference between \(p\) and \(q\): \(D_{KL}(p || q) = 0\).
- You have a complete code: \(z = 1\).
- If ideal lengths \(l_i =
log(\frac{1}{p_i})\) are not integers:
- \(H \leq L < H + 1\)
Huffman algorithm
- Making a binary tree.
- Given unconnected leaf nodes with values \(p_1, p_2, ...\).
- Take two smallest leaf nodes, and make a split node.
- Add the leaf node to the list of nodes.
- Repeat until there’s only one node in the list, the root node.
Main issue
- Symbol codes require \(\geq 1\) bit per character, even if \(H \ll 1\).
- For example:
- \(p = [0.01, 0.99]\)
- \(c = [0, 1]\)
- \(L = 1\)
- \(H = 0.01 log(1 / 0.01) + 0.99 log(1 / 0.99) = 0.08\)
- So, \(H \leq L < H + 1 \to 0.08 \leq 1 < 1.08\) holds, but this isn’t efficient!
Arithmetic coding
- Given a string in the alphabet \(\textbf{x} = x_1, x_2, ...\)
- Given an oracle \(P(x_t | x_1, x_2, ..., x_{t-1})\).
- Picture \(\textbf{x}\) dividing the
interval \([0, 1]\)
repeatedly. For example:
- Alphabet \({a, b, c}\).
- \(P(x_1) = {0.5, 0.25, 0.25}\)
- Interval of \(x_1 = a\) is \([0, 0.5]\).
- Interval of \(x_1 = b\) is \([0.5, 0.75]\).
- Interval of \(x_1 = c\) is \([0.75, 1]\).
- Further elements \(x_2, x_3, ...\) further subdivide the space.
- Take the interval \([i_{lower}, i_{higher}]\) described by \(\textbf{x}\).
- Find a binary string \(\textbf{y}\) with \(P(y_n=1) = 0.5\) that subdivides to the same interval.
- \(\textbf{y}\) is now the encoded version of \(\textbf{x}\)!
Length of encoding
- \(L(\textbf{x}) \leq log \frac{1}{P(\textbf{x})} + 2\)
- \(L(\textbf{x}) \leq H(\textbf{x}) + 2\)
Computation
You only need to compute \(NI\) conditional probabilities, where \(N\) is the length of the string and \(I\) is the size of the alphabet.
Main advantage over Huffman algorithm
- Huffman codes use at least 1 bit per token.
- \(L(x) \leq H(x) < L(x) + 1\) where \(x\) is a single token.
- Arithmetic coding can use less than 1 bit per token.
- \(L(X) \leq H(X) + 2\) where \(X\) is the whole string.
Building simple predictor
- Take \([0, 6]\)-gram statistics.
- Use these to predict the next token.
Lectures 6-8: Noisy channel coding
Entropy of joint distributions
- \(H(X, Y) = \sum_{x \in X}
\sum_{y \in Y} P(x, y) h(x, y)\)
- where \(h(x, y) = log \frac{1}{P(x, y)}\)
- \(H(X | Y=y) = \sum_{x \in
X} P(x | Y=y) h(x | y)\)
- where \(h(x | Y=y) = log \frac{1}{P(x | Y=y)}\)
- \(H(X | Y) = \sum_{y \in Y} P(Y=y) H(X | Y=y)\)
- \(H(X) + H(Y) \geq H(X, Y)\)
- \(H(X) + H(Y) = H(X, Y)\) if \(X\) and \(Y\) are independent.
- \(H(X | Y) \leq H(X)\)
- \(H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y)\)
Mutual information
- \(I(X ; Y) = H(X) - H(X | Y)\)
- \(I(X ; Y) = H(Y) - H(Y | X)\)
- \(I(X ; Y) = 1 - H(X | Y) - H(Y | X)\)
Formalizing channels as matrices
- We formalize channels as a matrix \(Q\), where:
- \(Q_{i, j} = P(Y = b_j | X = a_i)\)
- \(a\) is the input alphabet of length \(i\).
- \(b\) is the output alphabet of length \(j\).
- We can find \(P(X | Y)\) using Bayes.
Capacity of a channel
- \(C(Q) = \max_{P_x} I(X ;
Y)\)
- i.e. the maximum mutual information given we can choose the input probability distribution.
Optimal input distribution
- \(P_x\) chosen in \(C(Q) = \max_{P_x} I(X ; Y)\)
For BSC
- Optimal input distribution is \(P(x_i = 1) = 0.5\)
- Proof:
- \(p_1 = P(x_i = 1)\)
- \(I(X ; Y) = H(Y) - H(Y | X)\)
- \(I(X ; Y) = H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)\)
- \(max_{p_1} I(X ; Y)
=\)
- \(max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f) - H_2(f)\)
- \(max_{p_1} H_2(p_1 (1 - f) + (1 - p_1) f)\)
- Maximum value \(H_2\) can have is \(H_2(0.5) = 1\)
- \(0.5 = p_1 (1 - f) + (1 - p_1) f\)
- \(0.5 = p_1 - p_1 f + f - p_1 f\)
- \(0.5 = p_1 - f - 2 p_1 f\)
- \(0.5 = p_1 (1 - 2 f) - f\)
- \(p_1 = \frac{0.5 + f}{1 - 2 f} = 0.5\)
Shannon’s noisy-channel coding theorem
- For any \(\epsilon > 0\), \(R < C\), and large enough \(N\)…
- There exists a code with length \(N\) and rate \(R\)…
- Such that the probability of a block error is \(< \epsilon\).
Parity check matrix
Matrix definition
We transfer 4 bits, \(t_1, t_2, t_3, t_4\).
We append 3 parity bits:
- \(t_5 = t_1 + t_2 + t_3\)
- \(t_6 = t_2 + t_3 + t_4\)
- \(t_7 = t_1 + t_3 + t_4\)
We encode the relationships within \(\textbf{t}\) as a matrix \(H\), where \(H_{i, j}\) represents whether bit \(i\) was included in \(t_{3+j}\)’s parity calculation.
\[ H = \begin{pmatrix} 1 & 1 & 1 & 0 & 1 & 0 & 0\\ 0 & 1 & 1 & 1 & 0 & 1 & 0\\ 1 & 0 & 1 & 1 & 0 & 0 & 1\\ \end{pmatrix} \]
- Note that the last three columns make up an identity matrix.
Encoding & decoding
- \(t\) is encoded such that \(Ht = [0, 0, 0]^T = \textbf{0}\)
- Noise is added to the received signal \(r = t + n\)
- Syndrome \(z = Hr = H(t + n) = Hr + Hn = Hn\)
- The column in \(H\) that is equal to \(z\) identifies which bit was flipped.
Generalizing >4
- Above, \(H\) is a \((7 \times 3)\) matrix where \(4\) source bits are transmitted for every \(7\) transmission bits.
- Generally, \(H\) is a \(N \times M\) matrix where \(T = N - M\) bits are transmitted for every \(N\) bits.
- Thus, the rate is \(\frac{N - M}{N}\) or \(1 - \frac{M}{N}\).
Proving the theorem for BSC
- Say we have:
- A BSC with probability \(f\) of flipping, with a capacity of \(1 - H_2(f)\)
- A parity check matrix \(H \in \mathbb{R}^{N \times M}\) where \(N > M\).
- A source and noise that have \(N - M\) bits.
- A transmission that has \(N\) bits.
- A syndrome that has \(M\) bits.
- A “typical set syndrome decoder” \(M \in \mathbb{R}^M \times
\mathbb{R}^{N}\) that recovers the transmission
from the syndrome.
- This has \(2^{NH_2(f) + ...}\) entries.
- Transmissions have a rate \(R = 1 - \frac{M}{N}\).
- We want to prove \(R < C\) i.e. \(1 - \frac{M}{N} < 1 - H_2(f)\).
- \(P(error) = P_1 +
P_2\) where:
- \(P_1\) is the
probability of \(n\)
not being in \(M\).
- This is arbitrarily small as \(M\) contains the typical set.
- \(P_2\) is the
probability of there being a \(\hat{n} \neq n\) that has
the same syndrome \(z\).
- \(Hn = H\hat{n}\)
- \(H(n - \hat{n}) = 0\)
- We’ll focus on the constraints for \(P_2\).
- \(P_2 = \sum_n P(n)
1(\exists \hat{n}. \hat{n} \neq n, H(n - \hat{n}) =
0)\)
- (there exists at least one is less than the number of existences)
- \(P_2 \leq \sum_n P(n)
\sum_{\hat{n} \neq n} 1(H(n - \hat{n}) = 0)\)
- (we take the average \(H\) as it is simpler to not assume a specific \(H\))
- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} \sum_{H} P(H) 1(H(n - \hat{n}) = 0)\)
- Aside: \(P(H(n - \hat{n}) = 0 | n - \hat{n} \neq 0) = 2^{-M}\)
- \(P_2 \leq \sum_n P(n) \sum_{\hat{n} \neq n} 2^{-M}\)
- \(P_2 \leq \sum_{\hat{n} \neq n} 2^{-M}\)
- \(P_2 \leq 2^{NH_2(f)} \cdot 2^{-M}\)
- \(P_2 \leq 2^{NH_2(f)- M}\)
- So, \(P_2\)
vanishes as:
- \(M \gg NH_2(f)\)
- \(\frac{M}{N} > H_2(f)\)
- \(1 - \frac{M}{N} < 1 - H_2(f)\)
- \(R < C\)!!!
- \(P_1\) is the
probability of \(n\)
not being in \(M\).
Entropy when probabilities are certain
\[ H_2(0) = P(0) \log(\frac{1}{0}) = 0 \]
\[ H_2(1) = P(1) \log(\frac{1}{1}) = 0 \]
Feedback gem
The encoder knowing what noise is added doesn’t improve the rate.
Lectures 9-10: Introduction to Bayesian inference
No notes, covers the basics.
Lectures 11-14: Approximating Probability Distributions
K-means
Clustering algorithm. Given \(\{x_n\}\) data points to put into \(K\) clusters:
- Assign cluster means \(\{\mu_k\}\) randomly.
- Assign each data point to the closest cluster: \(r^n_k = \arg\min_{k} (x_n - \mu_n)^2 = k\)
- Update cluster means to the mean of its data points: \(\mu_k = \sum_n{r^n_k x_n} / \sum{r^n_k}\)
- Go to 2.
Bayesian interpretation
K-means is a Maximum A Posteriori (MAP) algorithm which assumes:
- There are \(K\) clusters.
- The probability of being in each cluster is equivalent.
- Each cluster has the same variance.
These assumptions might not be necessary. Additionally, MAP can result in incorrect cluster centers due to a hard boundary pushing clusters apart.
Soft K-means
Cluster responsibilities are no longer binary. Introduces an additional \(\beta\) hyperparameter. Update rule is the same.
\[ r^n_k = \frac{\exp(-\beta d(x_n, \mu_k))}{\sum_{k'} \exp(-\beta d(x_n, \mu_{k'}))} \]
Monte Carlo methods
The problem
We’re given:
- \(P(x) = \frac{P^*(x)}{Z}\), some probability distribution that we can’t sample from.
- \(Q(x)\), some simple probability distribution (e.g. uniform, gaussian) that we can sample from.
- \(\phi(x)\), some function of interest that we want to run over \(P\).
We want to either be able to draw samples from \(P\), or to calculate the value of \(\phi\).
MC importance sampling
\[ \Phi = \frac{\sum_n \frac{P^*(x_n)}{Q(x_n)} \phi(x_n)}{\sum_n \frac{P^*(x_n)}{Q(x_n)}} \]
where \(\{x_n\}_N\) are \(N\) samples drawn from \(Q\).
Choice of Q
If \(Q\) has little overlap with \(P\), we’ll be sampling from the wrong part of the distribution. Although we’ll still eventually converge, this can take a while and its impossible to tell that we’re converging.
Rejection sampling
Given a \(Q\) and constant \(c\) such that:
\[ \forall x. cQ(x) \ge P^*(x) \]
- Sample \(x \sim Q\).
- Sample \(u \sim \text{Uniform}(0, cQ(x))\).
- Store \(x\) if \(u < P^*(x)\), reject otherwise.
This provides \(\{x\}\) that are sampled from \(P\).
Markov Chain Monte Carlo (MCMC) methods
Different to Monte Carlo methods our samples are no longer independent. We draw from \(Q(x; x')\) rather than \(Q(x)\).
This works well when \(Q\) and \(P\) are very different.
Metropolis sampling
- Sample \(x \sim Q(x; x_t)\).
- Calculate \(a = \frac{P^*(x) Q(x_t; x)}{P^*(x_t) Q(x; x_t)}\).
- If \(a>1\),
accept, otherwise accept with probability \(a\).
- Accept means \(x_{t+1} = x\).
- Reject means \(x_{t+1} = x_t\).
Asymptotically, \(\{x\}\) is sampled from \(P\), but they won’t be independent.
Gaussian Q
\(Q(x; x_t) = \mathcal{N}(x_t, \sigma)\). By symmetry, the \(Q\)s in \(a\) cancel out.
Gibbs sampling
Assume we can sample from \(P^*(x_n | x_{!n})\), where \(x_{!n}\) is all elements of the hypothesis apart from \(x_n\).
We do Metropolis sampling, but while cycling through the variables in the hypothesis.
- \(x_t \sim Q(x_{1, t}; x_{!1, t-1})\)
- \(x_{t+1} \sim Q(x_{2, t+1}; x_{!2, t})\)
- …
Picking step sizes in Metropolis methods
\[ \epsilon = l \]
- We have \(k\) dimensions, of which \(\gamma\) have small acceptable width \(l\), and \(k - \gamma\) have large acceptable width \(L\).
- How do we define the step size \(\epsilon\) when exploring the space?
- We want a large step size: It takes \(T = \frac{L}{\epsilon}^2\) steps to explore the space, so we want to maximise \(\epsilon\).
- We want a small step size: Acceptance rate is \((\frac{l}{\epsilon})^\gamma\), so if \(\epsilon \gg l\) then the acceptance rate falls to zero.
- Thus, the middle ground is \(\epsilon = l\) with an acceptance rate of 50%.
Hamiltonian Monte Carlo
Uses momentum when selecting \(Q(x' ; x)\)
Slice sampling
- Evaluate \(P^*(x) = y\).
- Sample uniformly from \(z \sim [0, y]\).
- “Step out” from \(y\) in steps of size \(w\) until \(P^*(y + i w) < z\).
- Start with a random offset drawn from \([-w/2, w/2]\).
- Do this in both directions, yielding \(y_l, y_r\).
- Sample uniformly from \(y' \sim [y_l, y_r]\).
- If \(P^*(y') > z\), accept.
- If \(P^*(y') <
z\), sample again from \([y_l, y_r]\).
- But this time don’t include samples “beyond” \(y'\) wrt. the starting point \(y\).
Sensitivity of w
Insensitive: If it’s too small, we have more stepout steps, which is linear. If it’s too big, we have more rejection steps, but also trim down the sampling space \([y_l, y_r]\).
Exact sampling
Sampling method where you can be certain you get a random sample.
- We want a sample at \(t=0\).
- Run two samples as far away from each other in
sample space as possible, starting at \(t=-N\).
- Use the same RNG for both samples.
- If the converge to the same state, then you know
that all samples will converge.
- Thus, starting from \(t<-N\) would not tell us anything more about the state.
- If they don’t converge, start again at \(t=-2N\).
- Run the samples until \(t=0\), and you have a random sample.
Variational methods
Core idea
We’re interested in \(P(x) = \frac{1}{Z}P^*(x) = \frac{1}{Z}e^{-E(x)}\), but we can’t sample from \(E(x)\), only evaluate it.
So we approximate \(P(x)\) by a simpler distribution \(Q(x; \theta)\) and adjust \(\theta\) to get the best approximation.
KL divergence
\[ D_{KL}(Q || P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} \]
Effect of order of parameters
Zero probabilities
\(D_{KL}(Q || P)\) punishes when the approximate distribution puts mass where the true distribution is zero.
\(D_{KL}(P || Q)\) punishes when the approximate distribution puts zero mass where the true distribution is non-zero.
Computability
\(D_{KL}(A || B)\) performs an expectation over \(A\). Since we can’t sample from \(P\), we are effectively forced in to using \(D_{KL}(Q || P)\).
Variational free energy
We want to compute \(D_{KL}(Q || P)\). \(P(x)\) is not computable, \(E(x)\) is computable but not samplable. Thus we replace \(P\) with \(E\) and ignore the \(Z\) term as it is constant and thus not optimizable. This becomes the variational free energy.
\[ D_{KL}(Q || P) \\ \sum_x [ Q(x) \log \frac{Q(x)}{P(x)} ] \\ \sum_x [ Q(x) (\log Q(x) - \log P(x)) ] \\ \sum_x [ Q(x) (\log Q(x) - \log (\frac{1}{Z} e^{-E(x)})) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} - \log e^{-E(x)}) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} + E(x)) ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) - Q(x) \log \frac{1}{Z} ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] + \sum_x [ Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) \\ \]
Lectures 15-16: Data Modelling With Neural Networks
Capacity of a single neuron
Two bits. A neuron with \(k\) inputs can memorize \(2k\) random inputs.
Effect of in/bias/out stds
- Input parameters: Higher has more curves, shorter time scales.
- Bias parameters: Higher has more curves, same time scales.
- Output parameters: Higher has large magnitude outputs.
Effect of regularization weight
If your cost function has a regularising term \(\alpha L_2\), then this is equivalent to having a prior on your weights being drawn from \(\mathcal{N}(0, \frac{1}{\alpha})\).