# Math

## Approximating factorials

$$x! \approx x^x e^{-x}$$

## Binomial distribution

• $$f$$ probability of $$1$$, $$(1-f)$$ probability of $$0$$.
• What’s the probability distribution of the number of $$1$$s, given $$N$$ samples?
• $$P(r | f, N) = {N \choose r}f^r(1-f)^{N-r}$$

### Mean and variance

• $$mean(r) = Nf$$
• $$var(r) = Nf(1-f)$$

## Differentiation rules

### Exponential

• $$f(x) = e^x$$, $$f'(x) = e^x$$
• $$f(x) = a^x$$, $$f'(x) = a^x \ln(a)$$

### Logarithm

• $$f(x) = \log_e(x) = \ln(x)$$, $$f'(x) = 1 / x$$
• $$f(x) = \log_a(x)$$, $$f'(x) = 1 / (x \ln(a))$$

## Linear algebra

### Cross product

$A \times B = \left\Vert A \right\Vert \left\Vert B \right\Vert \sin{\theta} n$

### Dot product

$a \cdot b = \sum_i a_i b_i$

#### Dot product intuition

$$a \cdot b$$ measures how much $$a$$ and $$b$$ point in the same direction, scaled by their magnitude.

## Gaussian

$P(x | \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-\frac{(x - \mu)^2}{2 \sigma^2})$

## Exponential distribution

$P(x | \lambda) = \frac{e^{-\frac{x}{\lambda}}}{\mathcal{Z}}$ where $$\mathcal{Z}$$ is a normalizing factor so that $$\int P(x | \lambda) = 1$$.

## Bayes

$P(A | B) = \frac{P(B | A) P(A)}{P(B)}$ $\text{posterior} = \text{likelihood ratio} \cdot \text{prior}$ $\text{likelihood ratio} = \frac{P(B | A)}{P(B)}$

### Maximum Likelihood Estimate vs. Maximum a Priori

$\theta_{\text{MLE}} = \arg \max_\theta p(x | \theta) \\ \theta_{\text{MAP}} = \arg \max_\theta p(x | \theta) p(\theta)$

If $$p(\theta)$$ is uniform, $$\theta_{\text{MLE}} = \theta_{\text{MAP}}$$.

#### Using logarithms to make calculations easier

For example, for Maximum a Priori, we can do the following:

• $$\arg \max_\theta p(x | \theta) p(\theta)$$
• $$\arg \max_\theta \Pi_i (p(x_i | \theta)) p(\theta)$$
• $$\arg \max_\theta \log \Pi_i (p(x_i | \theta)) p(\theta)$$
• $$\arg \max_\theta \Sigma_i \log (p(x_i | \theta)) + \log p(\theta)$$

## Perplexity

Wiki.

$PP(x) = 2^{H(x)}$

## Properties of binary operations

### Commutative

$f(a, b) = f(b, a)$ ### Associativity $f(a, f(b, c)) = f(f(a, b), c)$ ### Distributive $f(a, g(b, c)) = f(g(a, b), g(a, c))$

For example, we say multiplication distributes over addition.

## Jacobian

Given a function $$f(x) = y$$ where $$x$$ and $$y$$ are vectors, the gradient of $$y$$ with respect to $$x$$ is the Jacobian:

$\frac{\delta y}{\delta x} = J = \begin{bmatrix} \frac{\delta y_1}{\delta x} & \frac{\delta y_2}{\delta x} & \frac{\delta y_3}{\delta x} & ... \end{bmatrix} = \begin{bmatrix} \frac{\delta y_1}{\delta x_1} & \frac{\delta y_2}{\delta x_1} & \frac{\delta y_3}{\delta x_1} & ... \\ \frac{\delta y_1}{\delta x_2} & \frac{\delta y_2}{\delta x_2} & \frac{\delta y_3}{\delta x_2} & ... \\ ... & ... & ... & ... \end{bmatrix}$

## Polar coordinates

Specify coordinates by distance from a central point and angle (wiki).