Principles of Deep Learning Theory

0: Initialization

0.2: Theoretical minimum

Use of Taylor expansion

We take the Taylor expansion of the trained function \(f(x ; \theta^*)\) around the initialized value of the parameters \(\theta\).

Three problems

Problem 1: Taylor expansion terms

The Taylor expansion has an infinite number of terms, and we need an increasing number of them to approximate trained parameters that are far away from initialization.

Problem 2: Random initialization

As \(\theta\) is drawn randomly, all of its derivatives are random functions. These have intricate statistical dependencies.

Problem 3: Dependencies

The trained function can depend on everything: The initialization, the function, all of its derivatives, the learning algorithm, the training data.

1: Pretraining

1.1: Gaussian integrals

Single-variable Gaussian functions

\[ e^{-\frac{z^2}{2}} \]

Integrating single-variable Gaussian functions

Intuition: Find the squared integral first, which allows you to re-parameterize as polar coordinates. This makes the integration trivial. End by taking the square root.

Standard normal distribution

\[ p(z) \equiv \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}} \]

Introducing variance & mean

The (non-normalized) distribution introduces variance \(K>0\):

\[ e^{-\frac{z^2}{2K}} \]

This shows up as an additional \(\sqrt{K}\) when integrating, making the full equation:

\[ p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{z^2}{2K}} \]

Introducing the mean is simple, we just replace \(z\) with \(z-s\) where \(s\) is the mean.

Multivariable Gaussian integrals

Unnormalized

\[ \exp [ -\frac{1}{2} \sum_{\mu, v} z_\mu (K^{-1})_{\mu v} z_v] \]

The covariance matrix \(K\) is inverted as this is the same as \(\frac{-z^2}{2K}\).

Integrating multivariable Gaussians

Roughly:

Use the orthogonal matrix \(O\) that diagonalizes \(K\).
This results in \(\sum_\mu \frac{1}{\lambda_\mu} (Oz)^2_\mu\).
Substitute \(u = Oz\), which we can do without changing the integration term as \(O\) is orthogonal.
The terms are now independent of the index \(\mu\), and can be expressed as a product:

\[ \sqrt{\prod_\mu 2 \pi \lambda_\mu} \]

But the product of a matrix’s eigenvalues is the determinant:

\[ \sqrt{|2 \pi K|} \]