Math

Approximating factorials

\(x! \approx x^x e^{-x}\)

Binomial distribution

Mean and variance

Differentiation rules

Exponential

Logarithm

Linear algebra

Cross product

\[ A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n \]

Dot product

\[ a \cdot b = \sum_i a_i b_i \]

Dot product intuition

\(a \cdot b\) measures how much \(a\) and \(b\) point in the same direction, scaled by their magnitude.

Gaussian

\[ P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2}) \]

Exponential distribution

\[ P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}} \]

where \(\mathcal{Z}\) is a normalizing factor so that \(\int P(x | \lambda) = 1\).

Bayes

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \]

\[ \text{posterior} = \text{likelihood ratio} \cdot \text{prior} \]

\[ \text{likelihood ratio} = \frac{P(B | A)}{P(B)} \]

Maximum Likelihood Estimate vs. Maximum a Priori

\[ \theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta) \]

If \(p(\theta)\) is uniform, \(\theta_{\text{MLE}} = \theta_{\text{MAP}}\).

Using logarithms to make calculations easier

For example, for Maximum a Priori, we can do the following:

Perplexity

Wiki.

\[ PP(x) = 2^{H(x)} \]

Properties of binary operations

Commutative

\[ f(a, b) = f(b, a) \]

Associativity

\[ f(a, f(b, c)) = f(f(a, b), c) \]

Distributive

\[ f(a, g(b, c)) = f(g(a, b), g(a, c)) \]

For example, we say multiplication distributes over addition.

Jacobian

Given a function \(f(x) = y\) where \(x\) and \(y\) are vectors, the gradient of \(y\) with respect to \(x\) is the Jacobian:

\[ \frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix} \]

Polar coordinates

Specify coordinates by distance from a central point and angle (wiki).

Numbers

Euler’s constant

2.718

Golden ratio

1.618

Square root of 2

1.414

Speed of light

2.99e8 m/s

Age of the universe

1.37e10 years

Age of Earth

4.54e9 years

Age of life on Earth

3.7e9 years

Age of Humanity

3e5 years

Nats, hartleys, shannons

Cross entropy

Wikipedia. The cross entropy between two distributions \(p, q\) is the number of bits needed to identify an event in \(p\) using a coding scheme optimized for \(q\).

\[ H(p, q) = - E_p[\log q] \\ H(p, q) = - \sum_x p(x) \log q(x) \]

Transpose rules

Addition

\[ (A + B)^T = A^T + B^T \]

Multiplication by constant

\[ (kA)^T = k(A^T) \]

Mat muls

\[ (AB)^T = B^T A^T \]

Singular Vector Decomposition

\[ M = USV^T \]

where:

Relation to eigenvectors

\(V\) contains the eigenvectors, \(S\) contains the square roots of the eigenvalues.

Exponentials of complex numbers

When taking the exponential of a complex number, we’re moving in the direction of that complex number, starting at 1.

\(e^x\) can be thought of as: we are at a point \(e^x\), our velocity at that point is also \(e^x\).

Reciprocal of complex numbers

\[ z^{-1} \\ = \frac{1}{a+bi} \\ = \frac{a-bi}{(a+bi)(a-bi)} \\ = \frac{a-bi}{a^2 + b^2} \\ = \frac{a}{a^2 + b^2} - \frac{bi}{a^2 + b^2} \]

For unit length complex numbers, \(a^2 + b^2 = 1\), so:

\[ (a + bi)^{-1} = a - bi \]

Thus you can get the real element of a unit length complex number with:

\[ (z + z^{-1}) / 2 \]

Even and odd functions

Taylor series

Taking non-polynomial functions, and approximating them using polynomials.

\[ g(x) = f(0) + \frac{1}{1!}\frac{df}{dx}(0)x + \frac{1}{2!}\frac{d^2f}{dx^2}(0)x^2 \]

Moving away from zero

Taylor series maths only works if the result of the differentials only includes a single \(c_n\) term. So we have to introduce an extra term \(s\):

\[ g(x) = f(s) + \frac{1}{1!}\frac{df}{dx}(s)(x - s) + \frac{1}{2!}\frac{d^2f}{dx^2}(s)(x-s)^2 \]

Derivative of multiplications

\[ y = f(x) g(x) \\ \frac{dy}{dx} = f(x) \frac{dg(x)}{dx} + \frac{df(x)}{dx} g(x) \]

Chain rule

\[ y = f(g(x)) \\ \frac{dy}{dx} = \frac{dy}{dg(x)} \frac{dg(x)}{dx} \]

Infimum

\[ \inf_{x \in (0, 1)} x = 0 \]

It’s the largest value that is smaller than all elements in a set.

Newton’s method

For root finding

Method for finding where a function \(f(x)\) is equal to zero.

\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]

Derived from the first-order Taylor series expansion of \(f\):

\[ f(x) = f(a) + f'(a)(x-a) \\ f(x_{n+1}) = f(x_n) + f'(x_n)(x_{n+1}-x_n) \\ f(x_{n+1}) = 0 \\ f(x_n) + f'(x_n)(x_{n+1}-x_n) = 0 \\ f(x_n) + x_{n+1}f'(x_n) - x_n f'(x_n) = 0 \\ x_{n+1}f'(x_n) = x_n f'(x_n) - f(x_n) \\ x_{n+1}= x_n -\frac{f(x_n)}{f'(x_n)} \\ \]

For optimization

Like for root finding, except we want to find where \(f'(x) = 0\), not where \(f(x) = 0\). To do this we need to bring in the second derivative (Hessian) in the Taylor expansion.

Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)

An approximation of Newton’s method for optimization. Instead of computing the Hessian directly, we approximate it using a FIFO history of point-estimates and first derivatives.

Affine

Linear plus translation: \(f(x) = Ax + b\)

Bernoulli distribution

Simple true/false distribution with probability \(p\).

A special case of the binomial distribution with \(n = 1\).

Big-O and variants

Variants

Formally

\(f(n) = \Theta(g(n))\) states that there exists constants \(c, C > 0\) such that \(c \cdot g(n) \le f(n) \le C \cdot g(n)\) for all sufficiently large \(n\).

Spectral norm

\[ \Vert A \Vert_* := \max_{v \in \mathbb{R} \setminus \{0\}} \frac{\Vert Av \Vert_2}{\Vert v \Vert_2} \]

i.e. the maximum amount that a matrix \(A\) will scale up a vector \(v\).