Machine learning

Image generation

Quantizing vectors

Latent space vector in $L \in \mathbb{R}^n$.
Quantise to $m$ vectors, represented by an $Q_{n \times m}$ matrix.
In forward pass, snap $L$ to closest row $L_Q$ in $Q$.
Add cost function to minimize $dist(L, L_Q)$.
In backwards pass, push gradients down to $L$ as if we never snapped to $Q$.

CLIP

Paper. Creates joint text and image embeddings from 400M image and text pairs. This is done by training embeddings for the pairs to have low cosine distance, and training non-pairs to have high cosine distance.

Diffusion models

Tutorial, more math.

Sequentially apply noise to an input image:
- $x_t \sim q(x_t | x_{t-1}) = \mathcal{N}(x_t; \mu=\sqrt{1 - \beta_t}x_{t-1}, \Sigma=\beta_t I)$
- Note that $\mu_t$ converges to zero.
We can avoid iterating through every step $t$ by reparameterising:
- $x_t \sim q(x_t | x_0) = \mathcal{N}(x_t; \mu=\sqrt{\alpha_t}x_0, \Sigma=(1 - \alpha_t) I)$
We then learn a model to reverse the noise:
- $x_t \sim p_{\theta}(x_t | x_{t + 1})$
- We use an ELBO loss function, where a lot of complicated maths comes in.
We often will instead train $\epsilon_{\theta}$ which models the error added at each step. We can use this to derive $\mu_{\theta}$ and fix $\Sigma_{\theta}$ to a constant.

Guided diffusion

Given a classifier $p_{\phi}$ and classes $y \in Y$, we define a new noise-reversion method: $$ {}(x_t | y) = {}(x_t) + s {}(x_t) {x_t}log(p_{}(y | x_t)) $$

This updates the image means towards being classified correctly. $s$ controls the weighting of the guidance, trading off between quality & diversity.

Classifier-free guidance

We train $\epsilon_{\theta}(x_t | y)$. To maintain diversity we don’t use this directly, and instead use: $$ {}(x_t | y) = {}(x_t | ) + s( {}(x_t | y) - {}(x_t | ) ) $$

CLIP guidance

We train CLIP embeddings with noisy (and non-noisy) images. When diffusing, we use the image & caption CLIP embeddings to guide the images to an embedding that matches the caption:

\[ \hat{\mu}_{\theta} = \mu_{\theta}(x_t | c) + s \Sigma_{\theta}(x_t | c) \nabla_{x_t}(f(x_t) \cdot g(c)) \]

where $f$ and $g$ are the image & caption CLIP embedding functions.

GLIDE (precursor to DALLE-2)

Paper. They train a classifier-free guided diffusion model. They also train an up-sampling diffusion model. Text conditioning is done using a Transformer. They find that classifier-free guidance works better than CLIP guidance.

Unconditional image generation

The GLIDE training data has 20% of its samples with no text. This allows the model to produce images without any captions.

Image in-painting

The GLIDE diffusion models are fine-tuned to perform in-painting. The diffusion models take the images with (1) the RGB channels of the noisy image (2) the RGB channels of the masked image and (3) the boolean mask.

Reinforcement learning

On-policy vs. off-policy RL

Source. On-policy learns values conditioned on the current policy being followed. Off-policy learns values based on potentially different policy being followed (e.g. Q-learning updates Q-values based on taking the greedy action in the next step).

The difference between on-policy and off-policy collapses if a greedy policy is always being followed. This doesn’t happen in practice, as greedy policies never explore.

In-context Reinforcement Learning with Algorithm Distillation

Paper. Fit a transformer on RL learning trajectories. The transformer can then learn out-of-distribution tasks. The performance isn’t quite as good as the underlying RL algorithm, but it can be more sample efficient: you can subsample the training data so the transformer learns to imitate a “faster” RL algorithm.

Time Series Forecasting With Deep Learning: A Survey

This paper surveys RNNs, CNNs, and attention-based methods for time series forecasting. One interesting conclusion is that hybrid models are winning the competitions. This might be because adding constraints to the models allows learning from smaller datasets.

Transformers

Mainly taken from A Survey of Transformers.

High-level architecture

Encoders & decoders both consist of $L$ blocks.
Each block consists of two modules:
- A multi-head self-attention module.
- A position-wise feed-forward network.
  - Position-wise: no shared info between tokens.
Each module is wrapped by a residual connection $x' = f(x) + x$.
Each module is followed by layer normalization.
Decoder blocks have an additional multi-head cross-attention module.
- Cross-attention: Attention over the input tokens, not the output tokens.
Decoder blocks self-attention modules are masked such that earlier tokens do not have access to later tokens.

Attention modules

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^\top}{\sqrt{D_k}}) V = AV \]

where:

$Q \in \mathbb{R}^{N \times D_k}$ are the queries produced by each of the $N$ output tokens.
$K \in \mathbb{R}^{M \times D_k}$ are the keys produced by each of the $M$ input tokens.
- N.B.: $N=M$ for self-attention.
$V \in \mathbb{R}^{M \times D_v}$ are the values produced by each of the $M$ input tokens.
$QK^\top$ calculates the dot product between all queries in the output and keys in the input.
$A \in \mathbb{R}^{N \times M} = \text{softmax}(\frac{QK^\top}{\sqrt{D_k}})$ is the attention matrix, where $A_{ij}$ is how much output $i$ is attending to input $j$.
- N.B.: We divide by $\sqrt{D_k}$ to alleviate vanishing gradient problems.
$AV \in \mathbb{R}^{N \times D_v}$ is a per-output weighted-sum of the input values $V$, where weights are taken from $A$.

Attention complexity

$O(T^2D)$ where $T$ is the input/output length, and $D$ is the key/value length.

Attention num parameters

$4D^2$ where $D$ is the key/value length.

Multi-head attention

\[ \text{MultiHeadAttention}(Q, K, V) = \text{Concat}( \text{Attention}(QW^Q_0, KW^K_0, VW^V_0), ..., \text{Attention}(QW^Q_H, KW^K_H, VW^V_H), ) W^O \]

where:

$H$ is the number of heads.
$Q \in \mathbb{R}^{N \times D_m}, K \in \mathbb{R}^{M \times D_m}, V \in \mathbb{R}^{M \times D_m}$ are the original queries, keys, and values with dimensionality $D_m$.
$W^Q_i, W^K_i, W^V_i$ are projections for each of the heads.
$W^O$ is the projection back into $D_m$.

Position-wise feed-forward network

\[ \text{FFN}(H) = \text{ReLU}(HW_1 + b_1)W_2 + b_2 \]

$H$ is the hidden outputs for a single token.
Input and output size is the same.
Hidden layer is typically bigger than input/output layers.
Weights do not change between positions.

PWFFN complexity

$O(TD^2)$ where $T$ is the input/output length, and $D$ is the key/value length.

PWFFN num parameters

$8D^2$ where $D$ is the key/value length.

Encoder and decoder use examples

Encoder & decoder: Language translation.
Encoder only: Classifying text.
Decoder only: Generating text.

Causal transformer

These are the same as “auto-regressive” transformers, where the task is to predict the next input token given the previous tokens.

Vision transformers (ViT)

This paper applies Transformers to computer vision in a very simple way:

Segment the image into patches.
Flatten each patch’s pixels.
Prepend a learnable “class” token, to signify to the model that this output should be the class of the image.
Run a Transformer over the flatten patches.
Add a 1d position embedding. 2d position embeddings don’t affect performance.
If doing classification, use an encoder-only Transformer and take the first output’s output embedding as the class.

IRIS

This paper introduces Imagination with auto-Regression over an Inner Speech (IRIS). This is a sample efficient RL training scheme.

Trains a discrete autoencoder over game frames.
Trains a auto-regressive Transformer to transition between the discrete encoded frames.
- The Transformer also predicts reward & termination probability.
Trains a policy on the raw frames, which is trained entirely using the Transformer’s simulation.

Elastic net regularization

Elastic net regularization is a regularization term than linearly weights the $L_1$ and $L_2$ terms of some parameters.

Techniques for Training Large Neural Networks

Post.

Data parallelism

Split the batch between different GPUs.
Take the average of the gradients, sync across all GPUs, and apply the change.
- You can do this asynchronously, but this degrades stability.
- In practice, you typically do this synchronously.
Relies on the model’s weights fitting on the GPU.

Pipeline parallelism

Decompose the model into N stages, such that each stage can fit on a GPU.
Do typical pipelining optimizations such that each stage is rarely idle.

Tensor parallelism

If one operation is too expensive for a single GPU, we can split it.
E.g. separate GPUs can handle different parts of a matrix multiplication.

Mixture-of-Experts (MoE) parallelism

Each GPU only works with a different set of neurons.
Add a “gating network” to choose which network to run.

Kernel trick

Given a mapping $f : \mathcal{R}^n \to \mathcal{R}^m$.
Given two points $x, y$.
We want to find $f(x)^T f(y)$.
If $f$ is hard to compute, we can instead learn a kernel: $$ k(x, y) = f(x)^T f(y) $$

Gaussian process

Gaussian processes are infinite-dimensional gaussian distributions that use a kernel function to define covariance instead of a matrix. The kernel function can be thought of as a prior over functions.

They can be useful for modelling (e.g. for regression) functions $f : [0, 1] \to [0, 1]$, where the dimensions is the infinite set of inputs $[0, 1]$.

Regression (perfect samples)

We want to model a continuous function $y = f(x)$ with a covariance kernel $k(x, x')$.
We have training datapoints $X_1, y_1$ and want to predict $y_2$ for values $X_2$.

We can define this as drawing from the multivariate gaussian distribution, i.e. “discretizing” the gaussian process.

\[ \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \sim \mathcal{N}( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11}, \Sigma_{12} \\ \Sigma_{21}, \Sigma_{22} \end{bmatrix} ) \]

where $\mu_i = \text{mean}(X_i)$ and $\Sigma_{ij} = k(X_i, X_j)$.

We can then get the conditional distribution:

\[ p(y_2 | X_2, y_1, X_1) = \mathcal{N}(\mu_{2|1}, \Sigma_{2|1}) \\ \mu_{2|1} = \mu_2 + \Sigma_{21} \Sigma_{11}^{-1} (y_1 - \mu_1) \\ \Sigma_{2|1} = \Sigma_{22} - \Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12} \]

Regression with noisy samples

As above but we model the noise of the observations by adding a value to the observation’s variance.

\[ \Sigma_{11} = k(X_1, X_1) + \sigma_\epsilon^2 I \]

L2 norm

\[ \sqrt{\sum_i x_i^2} \]

Standard deviation/variance

\[ \text{variance} = \sigma^2 = \frac{1}{n} \sum_i (x_i - \mu)^2 \\ \text{standard deviation} = \sigma \]

Parametric vs. non-parametric

Parametric models make assumptions about the distribution of the underlying data, while non-parametric models don’t.

Adam

Paper.

Works well with noisy / sparse gradients, and non-stationary objectives.
Track two exponentially decayed statistics of the gradient:
- The mean: $m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$
- The uncentered variance: $v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
- N.B.: Statistics are bias-corrected early on in learning as $\hat{m}_t = m_t / (1 - \beta_1^t) ; \hat{v}_t = m_t / (1 - \beta_2^t)$
Steps are $\Delta_t = \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$.
- i.e. step less if the variance is high.
Step size is capped $|\Delta_t| \leq \alpha$, except in cases of extreme sparsity.
- $|\hat{m}_t / \sqrt{\hat{v}_t}| \leq 1$ as the expected squared value will be greater than the square expected value.

AdamW

Paper. Excludes the weight decay parameter from Adam’s tracked statistics. This makes the weight decay equivalent to $L_2$ norm and empirically improves performance.

Bradley-Terry model

\[ P(i > j) = \frac{p_i}{p_i + p_j} \]

Where typically $p_{i,j}$ is parameterized by $e^{\beta_{i,j}}$.

Einsum

Einstein summation convention (einsum) is used by PyTorch/NumPy to make matrix operations easier.

For example, matmul:

a = torch.randn(5, 6)
b = torch.randn(6, 7)
torch.einsum("ij,jk->ik")

Einsum follows these rules:

Repeating letters in the inputs causes those dimensions to be multiplied together.
Excluding an input letter from the output causes that dimension to be summed.
The order of the output dimensions are followed, so you can do transposes.

Rotary Positional Embedding (RoPE)

We embed vectors by rotating them dependent on their position in the sequence. This has a great attribute for positional embeddings: the dot-product of positionally-embedded vectors is dependent only on the pre-embedded vectors & their relative position.

We pick an $\epsilon \in (0, \frac{\pi}{2N}]$ where $N$ is the sequence length, and encoding becomes:

\[ \text{RoPE}(x, m) = x e^{mi\epsilon} \]

i.e. We element-wise multiply by a rotation from $1$ to $i$ depending on the position in the sequence.

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Paper. Removes low-eigenvalue components of LLM params at specific points. Finds that this improves performance on tasks, and increases robustness to rephrasing.

Making Real-World Reinforcement Learning Practical

YouTube. What you need for sample-efficient real-world RL:

Good regularization.
Have as many gradient steps as possible per datum.
Limit “drastic” actions when in unfamiliar situations (to maximize useful data).
Multi-task so that there are experiences to be gained in failure states too.

Gradient flow

Gradient descent in the limit where the step size approaches zero.

On the Paradox of Learning to Reason from Data

Paper. Shows that BERT can’t generalise on and-only SAT problems. This is despite having the model complexity to handle these functions. However, during use chain of thought.

A General Theoretical Paradigm to Understand Learning from Human Preferences

Paper. Shows a unified framework for RLHF and DPO, $\Psi$PO, and introduces a new learning algorithm IPO that has better properties: it works offline (like DPO) and is not sensitive to deterministic preferences.

Interesting tidbit from this paper: One reason given for the instability of RLHF is that it has this sensitivity to deterministic preferences, but they counter this by using heavy regularisation when fitting the reward function.

Are Emergent Abilities of Large Language Models a Mirage?

Paper. Shows that jumps in metric values as you scale up models, aka emergent capabilities, are actually a product of nonlinear metric choice.

This suggests that we may be able to measure new capabilities ahead of time.

Calibrating Sequence likelihood Improves Conditional Language Generation

Paper. Assess the sequence-level likelihood of various completions, and trains against this metric. Finds that this does away with deciding errors that have to be patched by things like length normalisation and repetition penalties.