Information theory

Going through lecture series: Information Theory, Pattern Recognition, and Neural Networks by David MacKay. The textbook is available for free online.

Lecture 1: Introduction to Information Theory

Sending information over noisy channels (formulation)

Binary symmetric channel (BSC)

Lectures 2-5: Entropy and Data Compression


Shannon information content

Likely vs. unlikely events

Relationship to compressed file length



Binary entropy

\[ H_2(x) = x log \frac{1}{x} \]

Source coding theorem

Typical outcomes

Symbol codes

Expected length

\(L(C, X) = \sum_x{P(x)len(C(x))}\)

Kraft inequality

Prove the length a message can’t be greater than its entropy

Proving \(L(C, X) \geq H(X)\).

Huffman algorithm

Main issue

Arithmetic coding

Length of encoding


You only need to compute \(NI\) conditional probabilities, where \(N\) is the length of the string and \(I\) is the size of the alphabet.

Main advantage over Huffman algorithm

Building simple predictor

Lectures 6-8: Noisy channel coding

Entropy of joint distributions

Mutual information

Formalizing channels as matrices

Capacity of a channel

Optimal input distribution


Shannon’s noisy-channel coding theorem

Parity check matrix

Matrix definition
Encoding & decoding
Generalizing >4

Proving the theorem for BSC

Entropy when probabilities are certain

\[ H_2(0) = P(0) \log(\frac{1}{0}) = 0 \]

\[ H_2(1) = P(1) \log(\frac{1}{1}) = 0 \]

Feedback gem

The encoder knowing what noise is added doesn’t improve the rate.

Lectures 9-10: Introduction to Bayesian inference

No notes, covers the basics.

Lectures 11-14: Approximating Probability Distributions


Clustering algorithm. Given \(\{x_n\}\) data points to put into \(K\) clusters:

  1. Assign cluster means \(\{\mu_k\}\) randomly.
  2. Assign each data point to the closest cluster: \(r^n_k = \arg\min_{k} (x_n - \mu_n)^2 = k\)
  3. Update cluster means to the mean of its data points: \(\mu_k = \sum_n{r^n_k x_n} / \sum{r^n_k}\)
  4. Go to 2.

Bayesian interpretation

K-means is a Maximum A Posteriori (MAP) algorithm which assumes:

These assumptions might not be necessary. Additionally, MAP can result in incorrect cluster centers due to a hard boundary pushing clusters apart.

Soft K-means

Cluster responsibilities are no longer binary. Introduces an additional \(\beta\) hyperparameter. Update rule is the same.

\[ r^n_k = \frac{\exp(-\beta d(x_n, \mu_k))}{\sum_{k'} \exp(-\beta d(x_n, \mu_{k'}))} \]

Monte Carlo methods

The problem

We’re given: - \(P(x) = \frac{P^*(x)}{Z}\), some probability distribution that we can’t sample from. - \(Q(x)\), some simple probability distribution (e.g. uniform, gaussian) that we can sample from. - \(\phi(x)\), some function of interest that we want to run over \(P\).

We want to either be able to draw samples from \(P\), or to calculate the value of \(\phi\).

Importance sampling

\[ \Phi = \frac{\sum_n \frac{P^*(x_n)}{Q(x_n)} \phi(x_n)}{\sum_n \frac{P^*(x_n)}{Q(x_n)}} \]

where \(\{x_n\}_N\) are \(N\) samples drawn from \(Q\).

Choice of Q

If \(Q\) has little overlap with \(P\), we’ll be sampling from the wrong part of the distribution. Although we’ll still eventually converge, this can take a while and its impossible to tell that we’re converging.

Rejection sampling

Given a \(Q\) and constant \(c\) such that: \[ \forall x. cQ(x) \ge P^*(x) \]

  1. Sample \(x \sim Q\).
  2. Sample \(u \sim \text{Uniform}(0, cQ(x))\).
  3. Store \(x\) if \(u < P^*(x)\), reject otherwise.

This provides \(\{x\}\) that are sampled from \(P\).

Markov Chain Monte Carlo (MCMC) methods

Different to Monte Carlo methods our samples are no longer independent. We draw from \(Q(x; x')\) rather than \(Q(x)\).

This works well when \(Q\) and \(P\) are very different.

Metropolis sampling

  1. Sample \(x \sim Q(x; x_t)\).
  2. Calculate \(a = \frac{P^*(x) Q(x_t; x)}{P^*(x_t) Q(x; x_t)}\).
  3. If \(a>1\), accept, otherwise accept with probability \(a\).
    • Accept means \(x_{t+1} = x\).
    • Reject means \(x_{t+1} = x_t\).

Asymptotically, \(\{x\}\) is sampled from \(P\), but they won’t be independent.

Gaussian Q

\(Q(x; x_t) = \mathcal{N}(x_t, \sigma)\). By symmetry, the \(Q\)s in \(a\) cancel out.

Gibbs sampling

Assume we can sample from \(P^*(x_n | x_{!n})\), where \(x_{!n}\) is all elements of the hypothesis apart from \(x_n\).

We do Metropolis sampling, but while cycling through the variables in the hypothesis.

Picking step sizes in Metropolis methods

\[ \epsilon = l \]

Hamiltonian Monte Carlo

Uses momentum when selecting \(Q(x' ; x)\)

Slice sampling

Sensitivity of w

Insensitive: If it’s too small, we have more stepout steps, which is linear. If it’s too big, we have more rejection steps, but also trim down the sampling space \([y_l, y_r]\).

Exact sampling

Sampling method where you can be certain you get a random sample.

Variational methods

Core idea

We’re interested in \(P(x) = \frac{1}{Z}P^*(x) = \frac{1}{Z}e^{-E(x)}\), but we can’t sample from \(E(x)\), only evaluate it.

So we approximate \(P(x)\) by a simpler distribution \(Q(x; \theta)\) and adjust \(\theta\) to get the best approximation.

KL divergence

\[ D_{KL}(Q || P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)} \]

Effect of order of parameters
Zero probabilities

\(D_{KL}(Q || P)\) punishes when the approximate distribution puts mass where the true distribution is zero.

\(D_{KL}(P || Q)\) punishes when the approximate distribution puts zero mass where the true distribution is non-zero.


\(D_{KL}(A || B)\) performs an expectation over \(A\). Since we can’t sample from \(P\), we are effectively forced in to using \(D_{KL}(Q || P)\).

Variational free energy

We want to compute \(D_{KL}(Q || P)\). \(P(x)\) is not computable, \(E(x)\) is computable but not samplable. Thus we replace \(P\) with \(E\) and ignore the \(Z\) term as it is constant and thus not optimizable. This becomes the variational free energy.

\[ D_{KL}(Q || P) \\ \sum_x [ Q(x) \log \frac{Q(x)}{P(x)} ] \\ \sum_x [ Q(x) (\log Q(x) - \log P(x)) ] \\ \sum_x [ Q(x) (\log Q(x) - \log (\frac{1}{Z} e^{-E(x)})) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} - \log e^{-E(x)}) ] \\ \sum_x [ Q(x) (\log Q(x) - \log \frac{1}{Z} + E(x)) ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) - Q(x) \log \frac{1}{Z} ] \\ \sum_x [ Q(x) E(x) + Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] + \sum_x [ Q(x) \log Q(x) ] + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) + \log Z \\ \sum_x [ Q(x) E(x) ] - H_Q(x) \\ \]

Lectures 15-16: Data Modelling With Neural Networks

Capacity of a single neuron

Two bits. A neuron with \(k\) inputs can memorize \(2k\) random inputs.

Effect of in/bias/out stds

Effect of regularization weight

If your cost function has a regularising term \(\alpha L_2\), then this is equivalent to having a prior on your weights being drawn from \(\mathcal{N}(0, \frac{1}{\alpha})\).