# Principles of Deep Learning Theory

## 0: Initialization

### 0.2: Theoretical minimum

#### Use of Taylor expansion

We take the Taylor expansion of the trained function \(f(x ; \theta^*)\) around the initialized value of the parameters \(\theta\).

#### Three problems

##### Problem 1: Taylor expansion terms

The Taylor expansion has an infinite number of terms, and we need an increasing number of them to approximate trained parameters that are far away from initialization.

##### Problem 2: Random initialization

As \(\theta\) is drawn randomly, all of its derivatives are random functions. These have intricate statistical dependencies.

##### Problem 3: Dependencies

The trained function can depend on
*everything*: The initialization, the function,
all of its derivatives, the learning algorithm, the
training data.

## 1: Pretraining

### 1.1: Gaussian integrals

#### Single-variable Gaussian functions

\[ e^{-\frac{z^2}{2}} \]

#### Integrating single-variable Gaussian functions

Intuition: Find the squared integral first, which allows you to re-parameterize as polar coordinates. This makes the integration trivial. End by taking the square root.

#### Standard normal distribution

\[ p(z) \equiv \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}} \]

#### Introducing variance & mean

The (non-normalized) distribution introduces variance \(K>0\): \[ e^{-\frac{z^2}{2K}} \]

This shows up as an additional \(\sqrt{K}\) when integrating, making the full equation: \[ p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{z^2}{2K}} \]

Introducing the mean is simple, we just replace \(z\) with \(z-s\) where \(s\) is the mean.

#### Multivariable Gaussian integrals

##### Unnormalized

\[ \exp [ -\frac{1}{2} \sum_{\mu, v} z_\mu (K^{-1})_{\mu v} z_v] \] The covariance matrix \(K\) is inverted as this is the same as \(\frac{-z^2}{2K}\).

##### Integrating multivariable Gaussians

Roughly: - Use the orthogonal matrix \(O\) that diagonalizes \(K\). - This results in \(\sum_\mu \frac{1}{\lambda_\mu} (Oz)^2_\mu\). - Substitute \(u = Oz\), which we can do without changing the integration term as \(O\) is orthogonal. - The terms are now independent of the index \(\mu\), and can be expressed as a product:

\[ \sqrt{\prod_\mu 2 \pi \lambda_\mu} \]

- But the product of a matrixâ€™s eigenvalues is the determinant:

\[ \sqrt{|2 \pi K|} \]