# Math

## Approximating factorials

$$x! \approx x^x e^{-x}$$

## Binomial distribution

• $$f$$ probability of $$1$$, $$(1-f)$$ probability of $$0$$.
• What’s the probability distribution of the number of $$1$$s, given $$N$$ samples?
• $$P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}$$

### Mean and variance

• $$mean(r) = Nf$$
• $$var(r) = Nf(1-f)$$

## Differentiation rules

### Exponential

• $$f(x) = e^x$$, $$f'(x) = e^x$$
• $$f(x) = a^x$$, $$f'(x) = a^x \ln(a)$$

### Logarithm

• $$f(x) = \log_e(x) = \ln(x)$$, $$f'(x) = 1 / x$$
• $$f(x) = \log_a(x)$$, $$f'(x) = 1 / (x \ln(a))$$

## Linear algebra

### Cross product

$A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n$

### Dot product

$a \cdot b = \sum_i a_i b_i$

#### Dot product intuition

$$a \cdot b$$ measures how much $$a$$ and $$b$$ point in the same direction, scaled by their magnitude.

## Gaussian

$P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2})$

## Exponential distribution

$P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}}$ where $$\mathcal{Z}$$ is a normalizing factor so that $$\int P(x | \lambda) = 1$$.

## Bayes

$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$ $\text{posterior} = \text{likelihood ratio} \cdot \text{prior}$ $\text{likelihood ratio} = \frac{P(B | A)}{P(B)}$

### Maximum Likelihood Estimate vs. Maximum a Priori

$\theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta)$

If $$p(\theta)$$ is uniform, $$\theta_{\text{MLE}} = \theta_{\text{MAP}}$$.

#### Using logarithms to make calculations easier

For example, for Maximum a Priori, we can do the following:

• $$\arg \max_\theta p(x | \theta) p(\theta)$$
• $$\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)$$
• $$\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)$$
• $$\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)$$

## Perplexity

Wiki.

$PP(x) = 2^{H(x)}$

## Properties of binary operations

### Commutative

$f(a, b) = f(b, a)$ ### Associativity $f(a, f(b, c)) = f(f(a, b), c)$ ### Distributive $f(a, g(b, c)) = f(g(a, b), g(a, c))$

For example, we say multiplication distributes over addition.

## Jacobian

Given a function $$f(x) = y$$ where $$x$$ and $$y$$ are vectors, the gradient of $$y$$ with respect to $$x$$ is the Jacobian:

$\frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix}$

## Polar coordinates

Specify coordinates by distance from a central point and angle (wiki).

## Numbers

### Euler’s constant

2.718 ### Golden ratio 1.618 ### Square root of 2 1.414 ### Speed of light 2.99e8 m/s ### Age of the universe 1.37e10 years ### Age of Earth 4.54e9 years ### Age of life on Earth 3.7e9 years ### Age of Humanity 3e5 years

## Nats, hartleys, shannons

• Nat: Base $$e$$ bits of information.
• Hartley: Base $$10$$ bits of information.
• Shannons: Base $$2$$ bits of information.

## Cross entropy

Wikipedia. The cross entropy between two distributions $$p, q$$ is the number of bits needed to identify an event in $$p$$ using a coding scheme optimized for $$q$$.

$H(p, q) = - E_p[\log q] \\ H(p, q) = - \sum_x p(x) \log q(x)$

## Transpose rules

$(A + B)^T = A^T + B^T$ ### Multiplication by constant $(kA)^T = k(A^T)$ ### Mat muls $(AB)^T = B^T A^T$

## Singular Vector Decomposition

$M = USV^T$

where: - $$U, V$$ are orthonormal (full-rank & each row is a unit vector) “rotation” matrices. - $$S$$ is a diagonal “dilation” matrix

### Relation to eigenvectors

$$V$$ contains the eigenvectors, $$S$$ contains the square roots of the eigenvalues.

### Exponentials of complex numbers

When taking the exponential of a complex number, we’re moving in the direction of that complex number, starting at 1.

$$e^x$$ can be thought of as: we are at a point $$e^x$$, our velocity at that point is also $$e^x$$.

• When $$e^x=1$$, we are at position 1 and will move with velocity 1.
• When $$e^x=2$$, we are at position 2 and will move with velocity 2.
• When $$e^{2x}=1$$, we are at position 1 but will move with velocity 2.
• When $$e^{2x}=2$$, we are at position 2 but will move with velocity 4.
• When $$e^{ix}=1$$, we are at coordinate (1, 0), but will move with velocity (0, 1) - 90 degrees counterclockwise from (1, 0).
• When $$e^{ix}=1i$$, we are at coordinate (0, 1), but will move with velocity (-1, 0) - 90 degrees counterclockwise from (0, 1).

### Reciprocal of complex numbers

$z^{-1} \\ = \frac{1}{a+bi} \\ = \frac{a-bi}{(a+bi)(a-bi)} \\ = \frac{a-bi}{a^2 + b^2} \\ = \frac{a}{a^2 + b^2} - \frac{bi}{a^2 + b^2}$

For unit length complex numbers, $$a^2 + b^2 = 1$$, so: $(a + bi)^{-1} = a - bi$

Thus you can get the real element of a unit length complex number with: $(z + z^{-1}) / 2$

## Even and odd functions

• Even: $$f(-x) = f(x)$$
• Odd: $$f(-x) = -f(x)$$

## Taylor series

Taking non-polynomial functions, and approximating them using polynomials.

• Say we want to approximate some function $$f$$ around zero.
• We approximate with $$g(x) = c_0 + c_1x + c_2x^2$$.
• We add the following constraints:
• $$f(0) = g(0)$$
• $$\frac{df}{dx}(0) = \frac{dg}{dx}(0)$$
• $$\frac{d^2f}{dx^2}(0) = \frac{d^2g}{dx^2}(0)$$
• When we differentiate $$g$$, we get:
• $$g(0) = c_0 + c_1x + c_2x^2 = c_0$$
• $$\frac{dg}{dx}(0) = c_1 + 2c_2x = c_1$$
• $$\frac{d^2g}{dx^2}(0) = 2c_2$$
• So, the constraints become the evaluations of differentials of $$f$$ at zero!

$g(x) = f(0) + \frac{1}{1!}\frac{df}{dx}(0)x + \frac{1}{2!}\frac{d^2f}{dx^2}(0)x^2$

### Moving away from zero

Taylor series maths only works if the result of the differentials only includes a single $$c_n$$ term. So we have to introduce an extra term $$s$$:

$g(x) = f(s) + \frac{1}{1!}\frac{df}{dx}(s)(x - s) + \frac{1}{2!}\frac{d^2f}{dx^2}(s)(x-s)^2$

## Derivative of multiplications

$y = f(x) g(x) \\ \frac{dy}{dx} = f(x) \frac{dg(x)}{dx} + \frac{df(x)}{dx} g(x)$

## Chain rule

$y = f(g(x)) \\ \frac{dy}{dx} = \frac{dy}{dg(x)} \frac{dg(x)}{dx}$

## Infimum

$\inf_{x \in (0, 1)} x = 0$

It’s the largest value that is smaller than all elements in a set.