# Distilling Singular Learning Theory

## 1. The RLCT Measures the Effective Dimension of Neural Networks

### Preliminaries

#### Negative log likelihood

Likelihood is:

\[ p(y | x, w) = \prod_i p(y_i | x_i, w) \]

Negative log likelihood rearranges this:

\[ L_n(w) = \frac{1}{n} \log[-p(y | x, w)] \\ L_n(w) = - \frac{1}{n} \sum_i \log p(y_i | x_i, w) \]

(we divide by \(n\) to normalize across dataset sizes.)

#### Relationship between KL-divergence and Cross-Entropy loss

\[ D_\text{KL}(p || q) = L_\text{CE}(p, q) - H(p) \]

#### True parameters

The set of parameters where the KL-divergence is zero.

\[ W_0 = {w \in W | K(w) = 0} \]

#### Gaussian noise & NLL

If we have a regression model, we can define our posterior probability with Gaussian noise:

\[ p(y | x, w) = \frac{1}{(2\pi)^{M/2}} \exp(-\frac{1}{2} ||y - f(x, w)||^2) \]

If we take the NLL of this, we get:

\[ L_n(w) = \frac{M}{2} \log 2\pi + \frac{1}{2} \sum_i^n \frac{1}{2} ||y_i - f(x_i, w)||^2 \]

Which, up to a constant, is the same as MLE!

### Singular vs. Regular models

#### Score/informant

Measures the sensitivity of the log-likelihood function wrt. the parameter.

\[ s(w) = \frac{\delta}{\delta w} \log p(y | x, w) \]

When \(w \in W_0\), \(s(w) = 0\).

#### Fisher information matrix

The Fisher information is the variance of the score:

\[ I = \mathbb{E}[s(w)s(w)^T] \]

So in our case:

\[ I_{jk}(w) = \int \int (\frac{\delta}{\delta w_j} \log(p(y | x, w))) (\frac{\delta}{\delta w_k} \log(p(y | x, w))) p(y | x, w) q(x) dx dy \]

#### Fisher information matrix & KL-divergence

For true parameters, \(I(w)\) is equal to the Hessian of \(K(w)\):

\[ I_{jk}(w^{(0)}) = \left. \frac{\delta^2}{\delta w_j \delta w_k} K(w) \right|_{w=w^{(0)}} \]

#### Regular vs. singular

Regular models meet the following criteria:

- Positive definite Fisher information matrix for all \(w \in W\).
- Identifiable: \(p(y | x, w_1) = p(y | x, w_2)\) implies \(w_1 = w_2\).

Singular models do not.

#### Asymptotic normality (regular models)

The posterior converges to a normal distribution centered at the MLE \(w^{(0)}\):

\[ p(w | D_n) \to \mathcal{N}_d(w^{(0)}, \frac{1}{n} I(w^{(0)})^{-1}) \]

#### Deriving the Bayesian Information Criterion (BIC)

We want to find the amount of free energy, \(F_n\) of a *regular*
model.

- We Taylor expand the negative log-likelihood around \(w = w^{(0)}\).

\[ L_n(w) = L(w^{(0)}) + \left. (w - w^{(0)})^T \frac{\delta L_n(w)}{\delta w} \right|_{w=w^{(0)}} + \frac{1}{2} (w - w^{(0)})^T J(w^{(0)}) (w - w^{(0)}) + ... \]

- We substitute the Hessian \(J(w^{(0)})\) with the Fisher information matrix \(I(w^{(0)})\), and also the differential becomes zero at \(w=w^{(0)}\):

\[ L_n(w) = L(w^{(0)}) + \frac{1}{2} (w - w^{(0)})^T I(w^{(0)}) (w - w^{(0)}) + ... \]

- We can now define the partition function (i.e. the evidence) in terms of this approximated NLL. Note that we Taylor expand the prior, but ignore all but the first term as it’s a Gaussian distribution (?).

\[ Z_n \approx \int \exp( -nL_n(w^{(0)}) -\frac{n}{2} (w - w^{(0)})^T I(w^{(0)}) (w - w^{(0)}) ) \times [ \varphi(w^{(0)}) + ... ] dw \]

- We solve the integral, noticing that part of it is a multivariate Gaussian that we already know the integral of (where \(I\) is the variance matrix).

\[ Z_n \approx \exp(-nL_n(w^{(0)}))\varphi(w^{(0)}) \int \exp(-\frac{n}{2} (w - w^{(0)})^T I(w^{(0)}) (w - w^{(0)})) dw \\ = \frac{ \exp(-nL_n(w^{(0)}))\varphi(w^{(0)}) (2\pi)^{d/2} }{ n^{d/2} \sqrt{\det I(w^{(0)})} } \]

- We substitute this to calculate the free energy:

\[ F_n = -\log Z_n = nL_n(w^{(0)}) + \frac{d}{2} \log n - \log \varphi(w^{(0)}) - \frac{d}{2} \log 2\pi + \frac{1}{2} \det I(w^{(0)}) \]

- We ignore terms less than \(O(1)\) in \(n\):

\[ \text{BIC} = nL_n(w^{(0)}) + \frac{d}{2} \log n \]

We can’t do step 4 if the Fisher information matrix \(I\) is not invertible, thus this doesn’t apply to singular models.

### Real Log Canonical Threshold (RLCT)

We measure the volume of weight-space \(V(\epsilon) = \{w \in W | K(w) < \epsilon\}\) around a singularity (where \(K(w) = 0\)). This value allows us to derive an “effective dimensionality” of the singularity: The volume is mainly parameterized by a value \(\lambda\) which is equivalent to two times the dimensions of simple examples.

Intuitively, this ties back to the Fisher information matrix: How sensitive are the weights to the loss of the model? If they’re not sensitive, the volume will be large, and the dimensionality will be small.

#### Global vs. local

The RLCT is calculated on a singularity within the weight space. There are multiple singularities in singular models. The global singularity has the minimum \(\lambda\) value, as it occupies the largest volume.