Math
Approximating factorials
\(x! \approx x^x e^{-x}\)
Binomial distribution
- \(f\) probability of \(1\), \((1-f)\) probability of \(0\).
- What’s the probability distribution of the number of \(1\)s, given \(N\) samples?
- \(P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}\)
Mean and variance
- \(\text{mean}(r) = Nf\)
- \(\text{var}(r) = Nf(1-f)\)
Differentiation rules
Exponential
- \(f(x) = e^x\), \(f'(x) = e^x\)
- \(f(x) = a^x\), \(f'(x) = a^x \ln(a)\)
Logarithm
- \(f(x) = \log_e(x) = \ln(x)\), \(f'(x) = 1 / x\)
- \(f(x) = \log_a(x)\), \(f'(x) = 1 / (x \ln(a))\)
Linear algebra
Cross product
\[ A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n \]
Dot product
\[ a \cdot b = \sum_i a_i b_i \]
Dot product intuition
\(a \cdot b\) measures how much \(a\) and \(b\) point in the same direction, scaled by their magnitude.
Gaussian
\[ P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2}) \]
Exponential distribution
\[ P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}} \]
where \(\mathcal{Z}\) is a normalizing factor so that \(\int P(x | \lambda) = 1\).
Bayes
\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \]
\[ \text{posterior} = \text{likelihood ratio} \cdot \text{prior} \]
\[ \text{likelihood ratio} = \frac{P(B | A)}{P(B)} \]
Maximum Likelihood Estimate vs. Maximum a Priori
\[ \theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta) \]
If \(p(\theta)\) is uniform, \(\theta_{\text{MLE}} = \theta_{\text{MAP}}\).
Using logarithms to make calculations easier
For example, for Maximum a Priori, we can do the following:
- \(\arg \max_\theta p(x | \theta) p(\theta)\)
- \(\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)\)
- \(\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)\)
- \(\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)\)
Perplexity
Wiki.
\[ PP(x) = 2^{H(x)} \]
Properties of binary operations
Commutative
\[ f(a, b) = f(b, a) \]
Associativity
\[ f(a, f(b, c)) = f(f(a, b), c) \]
Distributive
\[ f(a, g(b, c)) = f(g(a, b), g(a, c)) \]
For example, we say multiplication distributes over addition.
Jacobian
Given a function \(f(x) = y\) where \(x\) and \(y\) are vectors, the gradient of \(y\) with respect to \(x\) is the Jacobian:
\[ \frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix} \]
Polar coordinates
Specify coordinates by distance from a central point and angle (wiki).
Numbers
Euler’s constant
2.718
Golden ratio
1.618
Square root of 2
1.414
Speed of light
2.99e8 m/s
Age of the universe
1.37e10 years
Age of Earth
4.54e9 years
Age of life on Earth
3.7e9 years
Age of Humanity
3e5 years
Nats, hartleys, shannons
- Nat: Base \(e\) bits of information.
- Hartley: Base \(10\) bits of information.
- Shannons: Base \(2\) bits of information.
Cross entropy
Wikipedia. The cross entropy between two distributions \(p, q\) is the number of bits needed to identify an event in \(p\) using a coding scheme optimized for \(q\).
\[ H(p, q) = - E_p[\log q] \\ H(p, q) = - \sum_x p(x) \log q(x) \]
Transpose rules
Addition
\[ (A + B)^T = A^T + B^T \]
Multiplication by constant
\[ (kA)^T = k(A^T) \]
Mat muls
\[ (AB)^T = B^T A^T \]
Singular Vector Decomposition
\[ M = USV^T \]
where:
- \(U, V\) are orthonormal (full-rank & each row is a unit vector) “rotation” matrices.
- \(S\) is a diagonal “dilation” matrix
Relation to eigenvectors
\(V\) contains the eigenvectors, \(S\) contains the square roots of the eigenvalues.
Exponentials of complex numbers
When taking the exponential of a complex number, we’re moving in the direction of that complex number, starting at 1.
\(e^x\) can be thought of as: we are at a point \(e^x\), our velocity at that point is also \(e^x\).
- When \(e^x=1\), we are at position 1 and will move with velocity 1.
- When \(e^x=2\), we are at position 2 and will move with velocity 2.
- When \(e^{2x}=1\), we are at position 1 but will move with velocity 2.
- When \(e^{2x}=2\), we are at position 2 but will move with velocity 4.
- When \(e^{ix}=1\), we are at coordinate (1, 0), but will move with velocity (0, 1) - 90 degrees counterclockwise from (1, 0).
- When \(e^{ix}=1i\), we are at coordinate (0, 1), but will move with velocity (-1, 0) - 90 degrees counterclockwise from (0, 1).
Reciprocal of complex numbers
\[ z^{-1} \\ = \frac{1}{a+bi} \\ = \frac{a-bi}{(a+bi)(a-bi)} \\ = \frac{a-bi}{a^2 + b^2} \\ = \frac{a}{a^2 + b^2} - \frac{bi}{a^2 + b^2} \]
For unit length complex numbers, \(a^2 + b^2 = 1\), so:
\[ (a + bi)^{-1} = a - bi \]
Thus you can get the real element of a unit length complex number with:
\[ (z + z^{-1}) / 2 \]
Even and odd functions
- Even: \(f(-x) = f(x)\)
- Odd: \(f(-x) = -f(x)\)
Taylor series
Taking non-polynomial functions, and approximating them using polynomials.
- Say we want to approximate some function \(f\) around zero.
- We approximate with \(g(x) = c_0 + c_1x + c_2x^2\).
- We add the following constraints:
- \(f(0) = g(0)\)
- \(\frac{df}{dx}(0) = \frac{dg}{dx}(0)\)
- \(\frac{d^2f}{dx^2}(0) = \frac{d^2g}{dx^2}(0)\)
- When we differentiate \(g\), we get:
- \(g(0) = c_0 + c_1x + c_2x^2 = c_0\)
- \(\frac{dg}{dx}(0) = c_1 + 2c_2x = c_1\)
- \(\frac{d^2g}{dx^2}(0) = 2c_2\)
- So, the constraints become the evaluations of differentials of \(f\) at zero!
\[ g(x) = f(0) + \frac{1}{1!}\frac{df}{dx}(0)x + \frac{1}{2!}\frac{d^2f}{dx^2}(0)x^2 \]
Moving away from zero
Taylor series maths only works if the result of the differentials only includes a single \(c_n\) term. So we have to introduce an extra term \(s\):
\[ g(x) = f(s) + \frac{1}{1!}\frac{df}{dx}(s)(x - s) + \frac{1}{2!}\frac{d^2f}{dx^2}(s)(x-s)^2 \]
Derivative of multiplications
\[ y = f(x) g(x) \\ \frac{dy}{dx} = f(x) \frac{dg(x)}{dx} + \frac{df(x)}{dx} g(x) \]
Chain rule
\[ y = f(g(x)) \\ \frac{dy}{dx} = \frac{dy}{dg(x)} \frac{dg(x)}{dx} \]
Infimum
\[ \inf_{x \in (0, 1)} x = 0 \]
It’s the largest value that is smaller than all elements in a set.
Newton’s method
For root finding
Method for finding where a function \(f(x)\) is equal to zero.
\[ x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)} \]
Derived from the first-order Taylor series expansion of \(f\):
\[ f(x) = f(a) + f'(a)(x-a) \\ f(x_{n+1}) = f(x_n) + f'(x_n)(x_{n+1}-x_n) \\ f(x_{n+1}) = 0 \\ f(x_n) + f'(x_n)(x_{n+1}-x_n) = 0 \\ f(x_n) + x_{n+1}f'(x_n) - x_n f'(x_n) = 0 \\ x_{n+1}f'(x_n) = x_n f'(x_n) - f(x_n) \\ x_{n+1}= x_n -\frac{f(x_n)}{f'(x_n)} \\ \]
For optimization
Like for root finding, except we want to find where \(f'(x) = 0\), not where \(f(x) = 0\). To do this we need to bring in the second derivative (Hessian) in the Taylor expansion.
Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS)
An approximation of Newton’s method for optimization. Instead of computing the Hessian directly, we approximate it using a FIFO history of point-estimates and first derivatives.
Affine
Linear plus translation: \(f(x) = Ax + b\)
Bernoulli distribution
Simple true/false distribution with probability \(p\).
A special case of the binomial distribution with \(n = 1\).
Big-O and variants
Variants
- \(f(n) = O(g(n))\) means \(f\) scales no faster than \(g\).
- \(f(n) = \Theta(g(n))\) means \(f\) scales in order with \(g\).
- \(f(n) = \Omega(g(n))\) means \(f\) scales at least as fast as \(g\).
Formally
\(f(n) = \Theta(g(n))\) states that there exists constants \(c, C > 0\) such that \(c \cdot g(n) \le f(n) \le C \cdot g(n)\) for all sufficiently large \(n\).
Spectral norm
\[ \Vert A \Vert_* := \max_{v \in \mathbb{R} \setminus \{0\}} \frac{\Vert Av \Vert_2}{\Vert v \Vert_2} \]
i.e. the maximum amount that a matrix \(A\) will scale up a vector \(v\).