# Principles of Deep Learning Theory

## 0: Initialization

### 0.2: Theoretical minimum

#### Use of Taylor expansion

We take the Taylor expansion of the trained function $$f(x ; \theta^*)$$ around the initialized value of the parameters $$\theta$$.

#### Three problems

##### Problem 1: Taylor expansion terms

The Taylor expansion has an infinite number of terms, and we need an increasing number of them to approximate trained parameters that are far away from initialization.

##### Problem 2: Random initialization

As $$\theta$$ is drawn randomly, all of its derivatives are random functions. These have intricate statistical dependencies.

##### Problem 3: Dependencies

The trained function can depend on everything: The initialization, the function, all of its derivatives, the learning algorithm, the training data.

## 1: Pretraining

### 1.1: Gaussian integrals

#### Single-variable Gaussian functions

$e^{-\frac{z^2}{2}}$

#### Integrating single-variable Gaussian functions

Intuition: Find the squared integral first, which allows you to re-parameterize as polar coordinates. This makes the integration trivial. End by taking the square root.

#### Standard normal distribution

$p(z) \equiv \frac{1}{\sqrt{2 \pi}} e^{-\frac{z^2}{2}}$

#### Introducing variance & mean

The (non-normalized) distribution introduces variance $$K>0$$: $e^{-\frac{z^2}{2K}}$

This shows up as an additional $$\sqrt{K}$$ when integrating, making the full equation: $p(z) \equiv \frac{1}{\sqrt{2 \pi K}} e^{-\frac{z^2}{2K}}$

Introducing the mean is simple, we just replace $$z$$ with $$z-s$$ where $$s$$ is the mean.

#### Multivariable Gaussian integrals

##### Unnormalized

$\exp [ -\frac{1}{2} \sum_{\mu, v} z_\mu (K^{-1})_{\mu v} z_v]$ The covariance matrix $$K$$ is inverted as this is the same as $$\frac{-z^2}{2K}$$.

##### Integrating multivariable Gaussians

Roughly: - Use the orthogonal matrix $$O$$ that diagonalizes $$K$$. - This results in $$\sum_\mu \frac{1}{\lambda_\mu} (Oz)^2_\mu$$. - Substitute $$u = Oz$$, which we can do without changing the integration term as $$O$$ is orthogonal. - The terms are now independent of the index $$\mu$$, and can be expressed as a product:

$\sqrt{\prod_\mu 2 \pi \lambda_\mu}$

• But the product of a matrix’s eigenvalues is the determinant:

$\sqrt{|2 \pi K|}$