# Math

## Approximating factorials

$$x! \approx x^x e^{-x}$$

## Binomial distribution

• $$f$$ probability of $$1$$, $$(1-f)$$ probability of $$0$$.
• What’s the probability distribution of the number of $$1$$s, given $$N$$ samples?
• $$P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}$$

### Mean and variance

• $$\text{mean}(r) = Nf$$
• $$\text{var}(r) = Nf(1-f)$$

## Differentiation rules

### Exponential

• $$f(x) = e^x$$, $$f'(x) = e^x$$
• $$f(x) = a^x$$, $$f'(x) = a^x \ln(a)$$

### Logarithm

• $$f(x) = \log_e(x) = \ln(x)$$, $$f'(x) = 1 / x$$
• $$f(x) = \log_a(x)$$, $$f'(x) = 1 / (x \ln(a))$$

## Linear algebra

### Cross product

$A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n$

### Dot product

$a \cdot b = \sum_i a_i b_i$

#### Dot product intuition

$$a \cdot b$$ measures how much $$a$$ and $$b$$ point in the same direction, scaled by their magnitude.

## Gaussian

$P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2})$

## Exponential distribution

$P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}}$

where $$\mathcal{Z}$$ is a normalizing factor so that $$\int P(x | \lambda) = 1$$.

## Bayes

$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$

$\text{posterior} = \text{likelihood ratio} \cdot \text{prior}$

$\text{likelihood ratio} = \frac{P(B | A)}{P(B)}$

### Maximum Likelihood Estimate vs. Maximum a Priori

$\theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta)$

If $$p(\theta)$$ is uniform, $$\theta_{\text{MLE}} = \theta_{\text{MAP}}$$.

#### Using logarithms to make calculations easier

For example, for Maximum a Priori, we can do the following:

• $$\arg \max_\theta p(x | \theta) p(\theta)$$
• $$\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)$$
• $$\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)$$
• $$\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)$$

## Perplexity

Wiki.

$PP(x) = 2^{H(x)}$

## Properties of binary operations

### Commutative

$f(a, b) = f(b, a)$

### Associativity

$f(a, f(b, c)) = f(f(a, b), c)$

### Distributive

$f(a, g(b, c)) = f(g(a, b), g(a, c))$

For example, we say multiplication distributes over addition.

## Jacobian

Given a function $$f(x) = y$$ where $$x$$ and $$y$$ are vectors, the gradient of $$y$$ with respect to $$x$$ is the Jacobian:

$\frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix}$

## Polar coordinates

Specify coordinates by distance from a central point and angle (wiki).

2.718

1.618

1.414

2.99e8 m/s

1.37e10 years

4.54e9 years

3.7e9 years

3e5 years

## Nats, hartleys, shannons

• Nat: Base $$e$$ bits of information.
• Hartley: Base $$10$$ bits of information.
• Shannons: Base $$2$$ bits of information.

## Cross entropy

Wikipedia. The cross entropy between two distributions $$p, q$$ is the number of bits needed to identify an event in $$p$$ using a coding scheme optimized for $$q$$.

$H(p, q) = - E_p[\log q] \\ H(p, q) = - \sum_x p(x) \log q(x)$

## Transpose rules

$(A + B)^T = A^T + B^T$

### Multiplication by constant

$(kA)^T = k(A^T)$

### Mat muls

$(AB)^T = B^T A^T$

## Singular Vector Decomposition

$M = USV^T$

where:

• $$U, V$$ are orthonormal (full-rank & each row is a unit vector) “rotation” matrices.
• $$S$$ is a diagonal “dilation” matrix

### Relation to eigenvectors

$$V$$ contains the eigenvectors, $$S$$ contains the square roots of the eigenvalues.

### Exponentials of complex numbers

When taking the exponential of a complex number, we’re moving in the direction of that complex number, starting at 1.

$$e^x$$ can be thought of as: we are at a point $$e^x$$, our velocity at that point is also $$e^x$$.

• When $$e^x=1$$, we are at position 1 and will move with velocity 1.
• When $$e^x=2$$, we are at position 2 and will move with velocity 2.
• When $$e^{2x}=1$$, we are at position 1 but will move with velocity 2.
• When $$e^{2x}=2$$, we are at position 2 but will move with velocity 4.
• When $$e^{ix}=1$$, we are at coordinate (1, 0), but will move with velocity (0, 1) - 90 degrees counterclockwise from (1, 0).
• When $$e^{ix}=1i$$, we are at coordinate (0, 1), but will move with velocity (-1, 0) - 90 degrees counterclockwise from (0, 1).

### Reciprocal of complex numbers

$z^{-1} \\ = \frac{1}{a+bi} \\ = \frac{a-bi}{(a+bi)(a-bi)} \\ = \frac{a-bi}{a^2 + b^2} \\ = \frac{a}{a^2 + b^2} - \frac{bi}{a^2 + b^2}$

For unit length complex numbers, $$a^2 + b^2 = 1$$, so:

$(a + bi)^{-1} = a - bi$

Thus you can get the real element of a unit length complex number with:

$(z + z^{-1}) / 2$

## Even and odd functions

• Even: $$f(-x) = f(x)$$
• Odd: $$f(-x) = -f(x)$$

## Taylor series

Taking non-polynomial functions, and approximating them using polynomials.

• Say we want to approximate some function $$f$$ around zero.
• We approximate with $$g(x) = c_0 + c_1x + c_2x^2$$.
• We add the following constraints:
• $$f(0) = g(0)$$
• $$\frac{df}{dx}(0) = \frac{dg}{dx}(0)$$
• $$\frac{d^2f}{dx^2}(0) = \frac{d^2g}{dx^2}(0)$$
• When we differentiate $$g$$, we get:
• $$g(0) = c_0 + c_1x + c_2x^2 = c_0$$
• $$\frac{dg}{dx}(0) = c_1 + 2c_2x = c_1$$
• $$\frac{d^2g}{dx^2}(0) = 2c_2$$
• So, the constraints become the evaluations of differentials of $$f$$ at zero!

$g(x) = f(0) + \frac{1}{1!}\frac{df}{dx}(0)x + \frac{1}{2!}\frac{d^2f}{dx^2}(0)x^2$

### Moving away from zero

Taylor series maths only works if the result of the differentials only includes a single $$c_n$$ term. So we have to introduce an extra term $$s$$:

$g(x) = f(s) + \frac{1}{1!}\frac{df}{dx}(s)(x - s) + \frac{1}{2!}\frac{d^2f}{dx^2}(s)(x-s)^2$

## Derivative of multiplications

$y = f(x) g(x) \\ \frac{dy}{dx} = f(x) \frac{dg(x)}{dx} + \frac{df(x)}{dx} g(x)$

## Chain rule

$y = f(g(x)) \\ \frac{dy}{dx} = \frac{dy}{dg(x)} \frac{dg(x)}{dx}$

## Infimum

$\inf_{x \in (0, 1)} x = 0$

It’s the largest value that is smaller than all elements in a set.

## Newton’s method

### For root finding

Method for finding where a function $$f(x)$$ is equal to zero.

$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$

Derived from the first-order Taylor series expansion of $$f$$:

$f(x) = f(a) + f'(a)(x-a) \\ f(x_{n+1}) = f(x_n) + f'(x_n)(x_{n+1}-x_n) \\ f(x_{n+1}) = 0 \\ f(x_n) + f'(x_n)(x_{n+1}-x_n) = 0 \\ f(x_n) + x_{n+1}f'(x_n) - x_n f'(x_n) = 0 \\ x_{n+1}f'(x_n) = x_n f'(x_n) - f(x_n) \\ x_{n+1}= x_n -\frac{f(x_n)}{f'(x_n)} \\$

### For optimization

Like for root finding, except we want to find where $$f'(x) = 0$$, not where $$f(x) = 0$$. To do this we need to bring in the second derivative (Hessian) in the Taylor expansion.

### Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)

An approximation of Newton’s method for optimization. Instead of computing the Hessian directly, we approximate it using a FIFO history of point-estimates and first derivatives.

## Affine

Linear plus translation: $$f(x) = Ax + b$$

## Bernoulli distribution

Simple true/false distribution with probability $$p$$.

A special case of the binomial distribution with $$n = 1$$.

## Big-O and variants

### Variants

• $$f(n) = O(g(n))$$ means $$f$$ scales no faster than $$g$$.
• $$f(n) = \Theta(g(n))$$ means $$f$$ scales in order with $$g$$.
• $$f(n) = \Omega(g(n))$$ means $$f$$ scales at least as fast as $$g$$.

### Formally

$$f(n) = \Theta(g(n))$$ states that there exists constants $$c, C > 0$$ such that $$c \cdot g(n) \le f(n) \le C \cdot g(n)$$ for all sufficiently large $$n$$.

## Spectral norm

$\Vert A \Vert_* := \max_{v \in \mathbb{R} \setminus \{0\}} \frac{\Vert Av \Vert_2}{\Vert v \Vert_2}$

i.e. the maximum amount that a matrix $$A$$ will scale up a vector $$v$$.