Math
Approximating factorials
\(x! \approx x^x e^{-x}\)
Binomial distribution
- \(f\) probability of \(1\), \((1-f)\) probability of \(0\).
- What’s the probability distribution of the number of \(1\)s, given \(N\) samples?
- \(P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}\)
Mean and variance
- \(mean(r) = Nf\)
- \(var(r) = Nf(1-f)\)
Differentiation rules
Exponential
- \(f(x) = e^x\), \(f'(x) = e^x\)
- \(f(x) = a^x\), \(f'(x) = a^x \ln(a)\)
Logarithm
- \(f(x) = \log_e(x) = \ln(x)\), \(f'(x) = 1 / x\)
- \(f(x) = \log_a(x)\), \(f'(x) = 1 / (x \ln(a))\)
Linear algebra
Cross product
\[ A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n \]
Dot product
\[ a \cdot b = \sum_i a_i b_i \]
Dot product intuition
\(a \cdot b\) measures how much \(a\) and \(b\) point in the same direction, scaled by their magnitude.
Gaussian
\[ P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2}) \]
Exponential distribution
\[ P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}} \] where \(\mathcal{Z}\) is a normalizing factor so that \(\int P(x | \lambda) = 1\).
Bayes
\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \] \[ \text{posterior} = \text{likelihood ratio} \cdot \text{prior} \] \[ \text{likelihood ratio} = \frac{P(B | A)}{P(B)} \]
Maximum Likelihood Estimate vs. Maximum a Priori
\[ \theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta) \]
If \(p(\theta)\) is uniform, \(\theta_{\text{MLE}} = \theta_{\text{MAP}}\).
Using logarithms to make calculations easier
For example, for Maximum a Priori, we can do the following:
- \(\arg \max_\theta p(x | \theta) p(\theta)\)
- \(\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)\)
- \(\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)\)
- \(\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)\)
Perplexity
Wiki.
\[ PP(x) = 2^{H(x)} \]
Properties of binary operations
Commutative
\[ f(a, b) = f(b, a) \] ### Associativity \[ f(a, f(b, c)) = f(f(a, b), c) \] ### Distributive \[ f(a, g(b, c)) = f(g(a, b), g(a, c)) \]
For example, we say multiplication distributes over addition.
Jacobian
Given a function \(f(x) = y\) where \(x\) and \(y\) are vectors, the gradient of \(y\) with respect to \(x\) is the Jacobian:
\[ \frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix} \]
Polar coordinates
Specify coordinates by distance from a central point and angle (wiki).
Numbers
Euler’s constant
2.718 ### Golden ratio 1.618 ### Square root of 2 1.414 ### Speed of light 2.99e8 m/s ### Age of the universe 1.37e10 years ### Age of Earth 4.54e9 years ### Age of life on Earth 3.7e9 years ### Age of Humanity 3e5 years
Nats, hartleys, shannons
- Nat: Base \(e\) bits of information.
- Hartley: Base \(10\) bits of information.
- Shannons: Base \(2\) bits of information.
Cross entropy
Wikipedia. The cross entropy between two distributions \(p, q\) is the number of bits needed to identify an event in \(p\) using a coding scheme optimized for \(q\).
\[ H(p, q) = - E_p[\log q] \\ H(p, q) = - \sum_x p(x) \log q(x) \]
Transpose rules
Addition
\[ (A + B)^T = A^T + B^T \] ### Multiplication by constant \[ (kA)^T = k(A^T) \] ### Mat muls \[ (AB)^T = B^T A^T \]
Singular Vector Decomposition
\[ M = USV^T \]
where: - \(U, V\) are orthonormal (full-rank & each row is a unit vector) “rotation” matrices. - \(S\) is a diagonal “dilation” matrix
Relation to eigenvectors
\(V\) contains the eigenvectors, \(S\) contains the square roots of the eigenvalues.
Exponentials of complex numbers
When taking the exponential of a complex number, we’re moving in the direction of that complex number, starting at 1.
\(e^x\) can be thought of as: we are at a point \(e^x\), our velocity at that point is also \(e^x\).
- When \(e^x=1\), we are at position 1 and will move with velocity 1.
- When \(e^x=2\), we are at position 2 and will move with velocity 2.
- When \(e^{2x}=1\), we are at position 1 but will move with velocity 2.
- When \(e^{2x}=2\), we are at position 2 but will move with velocity 4.
- When \(e^{ix}=1\), we are at coordinate (1, 0), but will move with velocity (0, 1) - 90 degrees counterclockwise from (1, 0).
- When \(e^{ix}=1i\), we are at coordinate (0, 1), but will move with velocity (-1, 0) - 90 degrees counterclockwise from (0, 1).
Reciprocal of complex numbers
\[ z^{-1} \\ = \frac{1}{a+bi} \\ = \frac{a-bi}{(a+bi)(a-bi)} \\ = \frac{a-bi}{a^2 + b^2} \\ = \frac{a}{a^2 + b^2} - \frac{bi}{a^2 + b^2} \]
For unit length complex numbers, \(a^2 + b^2 = 1\), so: \[ (a + bi)^{-1} = a - bi \]
Thus you can get the real element of a unit length complex number with: \[ (z + z^{-1}) / 2 \]
Even and odd functions
- Even: \(f(-x) = f(x)\)
- Odd: \(f(-x) = -f(x)\)
Taylor series
Taking non-polynomial functions, and approximating them using polynomials.
- Say we want to approximate some function \(f\) around zero.
- We approximate with \(g(x) = c_0 + c_1x + c_2x^2\).
- We add the following constraints:
- \(f(0) = g(0)\)
- \(\frac{df}{dx}(0) = \frac{dg}{dx}(0)\)
- \(\frac{d^2f}{dx^2}(0) = \frac{d^2g}{dx^2}(0)\)
- When we differentiate \(g\), we get:
- \(g(0) = c_0 + c_1x + c_2x^2 = c_0\)
- \(\frac{dg}{dx}(0) = c_1 + 2c_2x = c_1\)
- \(\frac{d^2g}{dx^2}(0) = 2c_2\)
- So, the constraints become the evaluations of differentials of \(f\) at zero!
\[ g(x) = f(0) + \frac{1}{1!}\frac{df}{dx}(0)x + \frac{1}{2!}\frac{d^2f}{dx^2}(0)x^2 \]
Moving away from zero
Taylor series maths only works if the result of the differentials only includes a single \(c_n\) term. So we have to introduce an extra term \(s\):
\[ g(x) = f(s) + \frac{1}{1!}\frac{df}{dx}(s)(x - s) + \frac{1}{2!}\frac{d^2f}{dx^2}(s)(x-s)^2 \]
Derivative of multiplications
\[ y = f(x) g(x) \\ \frac{dy}{dx} = f(x) \frac{dg(x)}{dx} + \frac{df(x)}{dx} g(x) \]
Chain rule
\[ y = f(g(x)) \\ \frac{dy}{dx} = \frac{dy}{dg(x)} \frac{dg(x)}{dx} \]
Infimum
\[ \inf_{x \in (0, 1)} x = 0 \]
It’s the largest value that is smaller than all elements in a set.