Math
Approximating factorials
\(x! \approx x^x e^{-x}\)
Binomial distribution
- \(f\) probability of \(1\), \((1-f)\) probability of \(0\).
- What’s the probability distribution of the number of \(1\)s, given \(N\) samples?
- \(P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}\)
Mean and variance
- \(mean(r) = Nf\)
- \(var(r) = Nf(1-f)\)
Differentiation rules
Exponential
- \(f(x) = e^x\), \(f'(x) = e^x\)
- \(f(x) = a^x\), \(f'(x) = a^x \ln(a)\)
Logarithm
- \(f(x) = \log_e(x) = \ln(x)\), \(f'(x) = 1 / x\)
- \(f(x) = \log_a(x)\), \(f'(x) = 1 / (x \ln(a))\)
Linear algebra
Cross product
\[ A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n \]
Dot product
\[ a \cdot b = \sum_i a_i b_i \]
Dot product intuition
\(a \cdot b\) measures how much \(a\) and \(b\) point in the same direction, scaled by their magnitude.
Gaussian
\[ P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2}) \]
Exponential distribution
\[ P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}} \] where \(\mathcal{Z}\) is a normalizing factor so that \(\int P(x | \lambda) = 1\).
Bayes
\[ P(A | B) = \frac{P(B | A) P(A)}{P(B)} \] \[ \text{posterior} = \text{likelihood ratio} \cdot \text{prior} \] \[ \text{likelihood ratio} = \frac{P(B | A)}{P(B)} \]
Maximum Likelihood Estimate vs. Maximum a Priori
\[ \theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta) \]
If \(p(\theta)\) is uniform, \(\theta_{\text{MLE}} = \theta_{\text{MAP}}\).
Using logarithms to make calculations easier
For example, for Maximum a Priori, we can do the following:
- \(\arg \max_\theta p(x | \theta) p(\theta)\)
- \(\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)\)
- \(\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)\)
- \(\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)\)
Perplexity
Wiki.
\[ PP(x) = 2^{H(x)} \]
Properties of binary operations
Commutative
\[ f(a, b) = f(b, a) \] ### Associativity \[ f(a, f(b, c)) = f(f(a, b), c) \] ### Distributive \[ f(a, g(b, c)) = f(g(a, b), g(a, c)) \]
For example, we say multiplication distributes over addition.
Jacobian
Given a function \(f(x) = y\) where \(x\) and \(y\) are vectors, the gradient of \(y\) with respect to \(x\) is the Jacobian:
\[ \frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix} \]
Polar coordinates
Specify coordinates by distance from a central point and angle (wiki).