Math

Approximating factorials

\(x! \approx x^x e^{-x}\)

Binomial distribution

\(f\) probability of \(1\), \((1-f)\) probability of \(0\).
What’s the probability distribution of the number of \(1\)s, given \(N\) samples?
\(P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}\)

Mean and variance

\(\text{mean}(r) = Nf\)
\(\text{var}(r) = Nf(1-f)\)

Differentiation rules

Exponential

\(f(x) = e^x\), \(f'(x) = e^x\)
\(f(x) = a^x\), \(f'(x) = a^x \ln(a)\)

Logarithm

\(f(x) = \log_e(x) = \ln(x)\), \(f'(x) = 1 / x\)
\(f(x) = \log_a(x)\), \(f'(x) = 1 / (x \ln(a))\)

Linear algebra

Cross product

\[ A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n \]

Dot product

\[ a \cdot b = \sum_i a_i b_i \]

Dot product intuition

\(a \cdot b\) measures how much \(a\) and \(b\) point in the same direction, scaled by their magnitude.

Gaussian

\[ P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2}) \]

Exponential distribution

\[ P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}} \]

where \(\mathcal{Z}\) is a normalizing factor so that \(\int P(x | \lambda) = 1\).

Bayes

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \]

\[ \text{posterior} = \text{likelihood ratio} \cdot \text{prior} \]

\[ \text{likelihood ratio} = \frac{P(B | A)}{P(B)} \]

Maximum Likelihood Estimate vs. Maximum a Priori

\[ \theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta) \]

If \(p(\theta)\) is uniform, \(\theta_{\text{MLE}} = \theta_{\text{MAP}}\).

Using logarithms to make calculations easier

For example, for Maximum a Priori, we can do the following:

\(\arg \max_\theta p(x | \theta) p(\theta)\)
\(\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)\)
\(\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)\)
\(\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)\)

Perplexity

Wiki.

\[ PP(x) = 2^{H(x)} \]

Properties of binary operations

Commutative

\[ f(a, b) = f(b, a) \]

Associativity

\[ f(a, f(b, c)) = f(f(a, b), c) \]

Distributive

\[ f(a, g(b, c)) = f(g(a, b), g(a, c)) \]

For example, we say multiplication distributes over addition.

Jacobian

Given a function \(f(x) = y\) where \(x\) and \(y\) are vectors, the gradient of \(y\) with respect to \(x\) is the Jacobian:

\[ \frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix} \]

Polar coordinates

Specify coordinates by distance from a central point and angle (wiki).

Numbers

Euler’s constant

2.718

Golden ratio

1.618

Square root of 2

1.414

Speed of light

2.99e8 m/s

Age of the universe

1.37e10 years

Age of Earth

4.54e9 years

Age of life on Earth

3.7e9 years

Age of Humanity

3e5 years

Nats, hartleys, shannons

Nat: Base \(e\) bits of information.
Hartley: Base \(10\) bits of information.
Shannons: Base \(2\) bits of information.

Cross entropy

Wikipedia. The cross entropy between two distributions \(p, q\) is the number of bits needed to identify an event in \(p\) using a coding scheme optimized for \(q\).

\[ H(p, q) = - E_p[\log q] \\ H(p, q) = - \sum_x p(x) \log q(x) \]

Transpose rules

Addition

\[ (A + B)^T = A^T + B^T \]

Multiplication by constant

\[ (kA)^T = k(A^T) \]

Mat muls

\[ (AB)^T = B^T A^T \]

Singular Vector Decomposition

\[ M = USV^T \]

where:

\(U, V\) are orthonormal (full-rank & each row is a unit vector) “rotation” matrices.
\(S\) is a diagonal “dilation” matrix

Relation to eigenvectors

\(V\) contains the eigenvectors, \(S\) contains the square roots of the eigenvalues.

Exponentials of complex numbers

When taking the exponential of a complex number, we’re moving in the direction of that complex number, starting at 1.

\(e^x\) can be thought of as: we are at a point \(e^x\), our velocity at that point is also \(e^x\).

When \(e^x=1\), we are at position 1 and will move with velocity 1.
When \(e^x=2\), we are at position 2 and will move with velocity 2.
When \(e^{2x}=1\), we are at position 1 but will move with velocity 2.
When \(e^{2x}=2\), we are at position 2 but will move with velocity 4.
When \(e^{ix}=1\), we are at coordinate (1, 0), but will move with velocity (0, 1) - 90 degrees counterclockwise from (1, 0).
When \(e^{ix}=1i\), we are at coordinate (0, 1), but will move with velocity (-1, 0) - 90 degrees counterclockwise from (0, 1).

Reciprocal of complex numbers

\[ z^{-1} \\ = \frac{1}{a+bi} \\ = \frac{a-bi}{(a+bi)(a-bi)} \\ = \frac{a-bi}{a^2 + b^2} \\ = \frac{a}{a^2 + b^2} - \frac{bi}{a^2 + b^2} \]

For unit length complex numbers, \(a^2 + b^2 = 1\), so:

\[ (a + bi)^{-1} = a - bi \]

Thus you can get the real element of a unit length complex number with:

\[ (z + z^{-1}) / 2 \]

Even and odd functions

Even: \(f(-x) = f(x)\)
Odd: \(f(-x) = -f(x)\)

Taylor series

Taking non-polynomial functions, and approximating them using polynomials.

Say we want to approximate some function \(f\) around zero.
We approximate with \(g(x) = c_0 + c_1x + c_2x^2\).
We add the following constraints:
- \(f(0) = g(0)\)
- \(\frac{df}{dx}(0) = \frac{dg}{dx}(0)\)
- \(\frac{d^2f}{dx^2}(0) = \frac{d^2g}{dx^2}(0)\)
When we differentiate \(g\), we get:
- \(g(0) = c_0 + c_1x + c_2x^2 = c_0\)
- \(\frac{dg}{dx}(0) = c_1 + 2c_2x = c_1\)
- \(\frac{d^2g}{dx^2}(0) = 2c_2\)
So, the constraints become the evaluations of differentials of \(f\) at zero!

\[ g(x) = f(0) + \frac{1}{1!}\frac{df}{dx}(0)x + \frac{1}{2!}\frac{d^2f}{dx^2}(0)x^2 \]

Moving away from zero

Taylor series maths only works if the result of the differentials only includes a single \(c_n\) term. So we have to introduce an extra term \(s\):

\[ g(x) = f(s) + \frac{1}{1!}\frac{df}{dx}(s)(x - s) + \frac{1}{2!}\frac{d^2f}{dx^2}(s)(x-s)^2 \]

Derivative of multiplications

\[ y = f(x) g(x) \\ \frac{dy}{dx} = f(x) \frac{dg(x)}{dx} + \frac{df(x)}{dx} g(x) \]

Chain rule

\[ y = f(g(x)) \\ \frac{dy}{dx} = \frac{dy}{dg(x)} \frac{dg(x)}{dx} \]

Infimum

\[ \inf_{x \in (0, 1)} x = 0 \]

It’s the largest value that is smaller than all elements in a set.

Newton’s method

For root finding

Method for finding where a function \(f(x)\) is equal to zero.

\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]

Derived from the first-order Taylor series expansion of \(f\):

\[ f(x) = f(a) + f'(a)(x-a) \\ f(x_{n+1}) = f(x_n) + f'(x_n)(x_{n+1}-x_n) \\ f(x_{n+1}) = 0 \\ f(x_n) + f'(x_n)(x_{n+1}-x_n) = 0 \\ f(x_n) + x_{n+1}f'(x_n) - x_n f'(x_n) = 0 \\ x_{n+1}f'(x_n) = x_n f'(x_n) - f(x_n) \\ x_{n+1}= x_n -\frac{f(x_n)}{f'(x_n)} \\ \]

For optimization

Like for root finding, except we want to find where \(f'(x) = 0\), not where \(f(x) = 0\). To do this we need to bring in the second derivative (Hessian) in the Taylor expansion.

Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)

An approximation of Newton’s method for optimization. Instead of computing the Hessian directly, we approximate it using a FIFO history of point-estimates and first derivatives.

Affine

Linear plus translation: \(f(x) = Ax + b\)

Bernoulli distribution

Simple true/false distribution with probability \(p\).

A special case of the binomial distribution with \(n = 1\).

Big-O and variants

Variants

\(f(n) = O(g(n))\) means \(f\) scales no faster than \(g\).
\(f(n) = \Theta(g(n))\) means \(f\) scales in order with \(g\).
\(f(n) = \Omega(g(n))\) means \(f\) scales at least as fast as \(g\).

Formally

\(f(n) = \Theta(g(n))\) states that there exists constants \(c, C > 0\) such that \(c \cdot g(n) \le f(n) \le C \cdot g(n)\) for all sufficiently large \(n\).

Spectral norm

\[ \Vert A \Vert_* := \max_{v \in \mathbb{R} \setminus \{0\}} \frac{\Vert Av \Vert_2}{\Vert v \Vert_2} \]

i.e. the maximum amount that a matrix \(A\) will scale up a vector \(v\).