Information theory

Going through lecture series: Information Theory, Pattern Recognition, and Neural Networks by David MacKay. The textbook is available for free online.

Lecture 1: Introduction to Information Theory

Sending information over noisy channels (formulation)

Binary symmetric channel (BSC)

Lectures 2-5: Entropy and Data Compression

Ensembles

Shannon information content

Likely vs. unlikely events

Relationship to compressed file length

Additive

Entropy

Binary entropy

\[ H_2(x) = x log \frac{1}{x} \]

Source coding theorem

Typical outcomes

Symbol codes

Expected length

\(L(C, X) = \sum_x{P(x)len(C(x))}\)

Kraft inequality

Prove the length a message can’t be greater than its entropy

Proving \(L(C, X) \geq H(X)\).

Huffman algorithm

Main issue

Arithmetic coding

Length of encoding

Computation

You only need to compute \(NI\) conditional probabilities, where \(N\) is the length of the string and \(I\) is the size of the alphabet.

Main advantage over Huffman algorithm

Building simple predictor

Lectures 6-8: Noisy channel coding

Entropy of joint distributions

Mutual information

Formalizing channels as matrices

Capacity of a channel

Optimal input distribution

For BSC

Shannon’s noisy-channel coding theorem

Parity check matrix

Matrix definition
Encoding & decoding
Generalizing >4

Proving the theorem for BSC

Entropy when probabilities are certain

\[ H_2(0) = P(0) \log(\frac{1}{0}) = 0 \]

\[ H_2(1) = P(1) \log(\frac{1}{1}) = 0 \]

Feedback gem

The encoder knowing what noise is added doesn’t improve the rate.

Lectures 9-10: Introduction to Bayesian inference

No notes, covers the basics.

Lectures 11-14: Approximating Probability Distributions

K-means

Clustering algorithm. Given \(\{x_n\}\) data points to put into \(K\) clusters:

  1. Assign cluster means \(\{\mu_k\}\) randomly.
  2. Assign each data point to the closest cluster: \(r^n_k = \arg\min_{k} (x_n - \mu_n)^2 = k\)
  3. Update cluster means to the mean of its data points: \(\mu_k = \sum_n{r^n_k x_n} / \sum{r^n_k}\)
  4. Go to 2.

Bayesian interpretation

K-means is a Maximum A Posteriori (MAP) algorithm which assumes:

These assumptions might not be necessary. Additionally, MAP can result in incorrect cluster centers due to a hard boundary pushing clusters apart.

Soft K-means

Cluster responsibilities are no longer binary. Introduces an additional \(\beta\) hyperparameter. Update rule is the same.

\[ r^n_k = \frac{\exp(-\beta d(x_n, \mu_k))}{\sum_{k'} \exp(-\beta d(x_n, \mu_{k'}))} \]

Monte Carlo methods

The problem

We’re given: - \(P(x) = \frac{P^*(x)}{Z}\), some probability distribution that we can’t sample from. - \(Q(x)\), some simple probability distribution (e.g. uniform, gaussian) that we can sample from. - \(\phi(x)\), some function of interest that we want to run over \(P\).

We want to either be able to draw samples from \(P\), or to calculate the value of \(\phi\).

Importance sampling

\[ \Phi = \frac{\sum_n \frac{P^*(x_n)}{Q(x_n)} \phi(x_n)}{\sum_n \frac{P^*(x_n)}{Q(x_n)}} \]

where \(\{x_n\}_N\) are \(N\) samples drawn from \(Q\).

Choice of Q

If \(Q\) has little overlap with \(P\), we’ll be sampling from the wrong part of the distribution. Although we’ll still eventually converge, this can take a while and its impossible to tell that we’re converging.

Rejection sampling

Given a \(Q\) and constant \(c\) such that: \[ \forall x. cQ(x) \ge P^*(x) \]

  1. Sample \(x \sim Q\).
  2. Sample \(u \sim \text{Uniform}(0, cQ(x))\).
  3. Store \(x\) if \(u < P^*(x)\), reject otherwise.

This provides \(\{x\}\) that are sampled from \(P\).

Markov Chain Monte Carlo (MCMC) methods

Different to Monte Carlo methods our samples are no longer independent. We draw from \(Q(x; x')\) rather than \(Q(x)\).

This works well when \(Q\) and \(P\) are very different.

Metropolis sampling

  1. Sample \(x \sim Q(x; x_t)\).
  2. Calculate \(a = \frac{P^*(x) Q(x_t; x)}{P^*(x_t) Q(x; x_t)}\).
  3. If \(a>1\), accept, otherwise accept with probability \(a\).
    • Accept means \(x_{t+1} = x\).
    • Reject means \(x_{t+1} = x_t\).

Asymptotically, \(\{x\}\) is sampled from \(P\), but they won’t be independent.

Gaussian Q

\(Q(x; x_t) = \mathcal{N}(x_t, \sigma)\). By symmetry, the \(Q\)s in \(a\) cancel out.

Gibbs sampling

Assume we can sample from \(P^*(x_n | x_{!n})\), where \(x_{!n}\) is all elements of the hypothesis apart from \(x_n\).

We do Metropolis sampling, but while cycling through the variables in the hypothesis.