Principles of Deep Learning Theory
0: Initialization
0.2: Theoretical minimum
Use of Taylor expansion
We take the Taylor expansion of the trained function \(f(x ; \theta^*)\) around the initialized value of the parameters \(\theta\).
Three problems
Problem 1: Taylor expansion terms
The Taylor expansion has an infinite number of terms, and we need an increasing number of them to approximate trained parameters that are far away from initialization.
Problem 2: Random initialization
As \(\theta\) is drawn randomly, all of its derivatives are random functions. These have intricate statistical dependencies.
Problem 3: Dependencies
The trained function can depend on everything: The initialization, the function, all of its derivatives, the learning algorithm, the training data.
1: Pretraining
1.1: Gaussian integrals
Single-variable Gaussian functions
\[ e^{-\frac{z^2}{2}} \]
Integrating single-variable Gaussian functions
Intuition: Find the squared integral first, which allows you to re-parameterize as polar coordinates. This makes the integration trivial. End by taking the square root.
Standard normal distribution
\[ p(z) \equiv \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}} \]
Introducing variance & mean
The (non-normalized) distribution introduces variance \(K>0\):
\[ e^{-\frac{z^2}{2K}} \]
This shows up as an additional \(\sqrt{K}\) when integrating, making the full equation:
\[ p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{z^2}{2K}} \]
Introducing the mean is simple, we just replace \(z\) with \(z-s\) where \(s\) is the mean.
Multivariable Gaussian integrals
Unnormalized
\[ \exp [ -\frac{1}{2} \sum_{\mu, v} z_\mu (K^{-1})_{\mu v} z_v] \]
The covariance matrix \(K\) is inverted as this is the same as \(\frac{-z^2}{2K}\).
Integrating multivariable Gaussians
Roughly:
- Use the orthogonal matrix \(O\) that diagonalizes \(K\).
- This results in \(\sum_\mu \frac{1}{\lambda_\mu} (Oz)^2_\mu\).
- Substitute \(u = Oz\), which we can do without changing the integration term as \(O\) is orthogonal.
- The terms are now independent of the index \(\mu\), and can be expressed as a product:
\[ \sqrt{\prod_\mu 2 \pi \lambda_\mu} \]
- But the product of a matrix’s eigenvalues is the determinant:
\[ \sqrt{|2 \pi K|} \]