Machine learning

Image generation

Quantizing vectors

CLIP

Paper. Creates joint text and image embeddings from 400M image and text pairs. This is done by training embeddings for the pairs to have low cosine distance, and training non-pairs to have high cosine distance.

Diffusion models

Tutorial, more math.

Guided diffusion

Given a classifier \(p_{\phi}\) and classes \(y \in Y\), we define a new noise-reversion method: $$ {}(x_t | y) = {}(x_t) + s {}(x_t) {x_t}log(p_{}(y | x_t)) $$

This updates the image means towards being classified correctly. \(s\) controls the weighting of the guidance, trading off between quality & diversity.

Classifier-free guidance

We train \(\epsilon_{\theta}(x_t | y)\). To maintain diversity we don’t use this directly, and instead use: $$ {}(x_t | y) = {}(x_t | ) + s( {}(x_t | y) - {}(x_t | ) ) $$

CLIP guidance

We train CLIP embeddings with noisy (and non-noisy) images. When diffusing, we use the image & caption CLIP embeddings to guide the images to an embedding that matches the caption:

\[ \hat{\mu}_{\theta} = \mu_{\theta}(x_t | c) + s \Sigma_{\theta}(x_t | c) \nabla_{x_t}(f(x_t) \cdot g(c)) \]

where \(f\) and \(g\) are the image & caption CLIP embedding functions.

GLIDE (precursor to DALLE-2)

Paper. They train a classifier-free guided diffusion model. They also train an up-sampling diffusion model. Text conditioning is done using a Transformer. They find that classifier-free guidance works better than CLIP guidance.

Unconditional image generation

The GLIDE training data has 20% of its samples with no text. This allows the model to produce images without any captions.

Image in-painting

The GLIDE diffusion models are fine-tuned to perform in-painting. The diffusion models take the images with (1) the RGB channels of the noisy image (2) the RGB channels of the masked image and (3) the boolean mask.

Reinforcement learning

On-policy vs. off-policy RL

Source. On-policy learns values conditioned on the current policy being followed. Off-policy learns values based on potentially different policy being followed (e.g. Q-learning updates Q-values based on taking the greedy action in the next step).

The difference between on-policy and off-policy collapses if a greedy policy is always being followed. This doesn’t happen in practice, as greedy policies never explore.

In-context Reinforcement Learning with Algorithm Distillation

Paper. Fit a transformer on RL learning trajectories. The transformer can then learn out-of-distribution tasks. The performance isn’t quite as good as the underlying RL algorithm, but it can be more sample efficient: you can subsample the training data so the transformer learns to imitate a “faster” RL algorithm.

Time Series Forecasting With Deep Learning: A Survey

This paper surveys RNNs, CNNs, and attention-based methods for time series forecasting. One interesting conclusion is that hybrid models are winning the competitions. This might be because adding constraints to the models allows learning from smaller datasets.

Transformers

Mainly taken from A Survey of Transformers.

High-level architecture

Attention modules

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^\top}{\sqrt{D_k}}) V = AV \]

where:

Attention complexity

\(O(T^2D)\) where \(T\) is the input/output length, and \(D\) is the key/value length.

Attention num parameters

\(4D^2\) where \(D\) is the key/value length.

Multi-head attention

\[ \text{MultiHeadAttention}(Q, K, V) = \text{Concat}( \text{Attention}(QW^Q_0, KW^K_0, VW^V_0), ..., \text{Attention}(QW^Q_H, KW^K_H, VW^V_H), ) W^O \]

where:

Position-wise feed-forward network

\[ \text{FFN}(H) = \text{ReLU}(HW_1 + b_1)W_2 + b_2 \]

PWFFN complexity

\(O(TD^2)\) where \(T\) is the input/output length, and \(D\) is the key/value length.

PWFFN num parameters

\(8D^2\) where \(D\) is the key/value length.

Encoder and decoder use examples

Causal transformer

These are the same as “auto-regressive” transformers, where the task is to predict the next input token given the previous tokens.

Vision transformers (ViT)

This paper applies Transformers to computer vision in a very simple way:

  1. Segment the image into patches.
  2. Flatten each patch’s pixels.
  3. Prepend a learnable “class” token, to signify to the model that this output should be the class of the image.
  4. Run a Transformer over the flatten patches.
  5. Add a 1d position embedding. 2d position embeddings don’t affect performance.
  6. If doing classification, use an encoder-only Transformer and take the first output’s output embedding as the class.

IRIS

This paper introduces Imagination with auto-Regression over an Inner Speech (IRIS). This is a sample efficient RL training scheme.

Elastic net regularization

Elastic net regularization is a regularization term than linearly weights the \(L_1\) and \(L_2\) terms of some parameters.

Techniques for Training Large Neural Networks

Post.

Data parallelism

Pipeline parallelism

Tensor parallelism

Mixture-of-Experts (MoE) parallelism

Kernel trick

Gaussian process

Gaussian processes are infinite-dimensional gaussian distributions that use a kernel function to define covariance instead of a matrix. The kernel function can be thought of as a prior over functions.

They can be useful for modelling (e.g. for regression) functions \(f : [0, 1] \to [0, 1]\), where the dimensions is the infinite set of inputs \([0, 1]\).

Regression (perfect samples)

We can define this as drawing from the multivariate gaussian distribution, i.e. “discretizing” the gaussian process.

\[ \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \sim \mathcal{N}( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11}, \Sigma_{12} \\ \Sigma_{21}, \Sigma_{22} \end{bmatrix} ) \]

where \(\mu_i = \text{mean}(X_i)\) and \(\Sigma_{ij} = k(X_i, X_j)\).

We can then get the conditional distribution:

\[ p(y_2 | X_2, y_1, X_1) = \mathcal{N}(\mu_{2|1}, \Sigma_{2|1}) \\ \mu_{2|1} = \mu_2 + \Sigma_{21} \Sigma_{11}^{-1} (y_1 - \mu_1) \\ \Sigma_{2|1} = \Sigma_{22} - \Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12} \]

Regression with noisy samples

As above but we model the noise of the observations by adding a value to the observation’s variance.

\[ \Sigma_{11} = k(X_1, X_1) + \sigma_\epsilon^2 I \]

L2 norm

\[ \sqrt{\sum_i x_i^2} \]

Standard deviation/variance

\[ \text{variance} = \sigma^2 = \frac{1}{n} \sum_i (x_i - \mu)^2 \\ \text{standard deviation} = \sigma \]

Parametric vs. non-parametric

Parametric models make assumptions about the distribution of the underlying data, while non-parametric models don’t.

Adam

Paper.

AdamW

Paper. Excludes the weight decay parameter from Adam’s tracked statistics. This makes the weight decay equivalent to \(L_2\) norm and empirically improves performance.

Bradley-Terry model

\[ P(i > j) = \frac{p_i}{p_i + p_j} \]

Where typically \(p_{i,j}\) is parameterized by \(e^{\beta_{i,j}}\).

Einsum

Einstein summation convention (einsum) is used by PyTorch/NumPy to make matrix operations easier.

For example, matmul:

a = torch.randn(5, 6)
b = torch.randn(6, 7)
torch.einsum("ij,jk->ik")

Einsum follows these rules:

Rotary Positional Embedding (RoPE)

We embed vectors by rotating them dependent on their position in the sequence. This has a great attribute for positional embeddings: the dot-product of positionally-embedded vectors is dependent only on the pre-embedded vectors & their relative position.

We pick an \(\epsilon \in (0, \frac{\pi}{2N}]\) where \(N\) is the sequence length, and encoding becomes:

\[ \text{RoPE}(x, m) = x e^{mi\epsilon} \]

i.e. We element-wise multiply by a rotation from \(1\) to \(i\) depending on the position in the sequence.

The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Paper. Removes low-eigenvalue components of LLM params at specific points. Finds that this improves performance on tasks, and increases robustness to rephrasing.

Making Real-World Reinforcement Learning Practical

YouTube. What you need for sample-efficient real-world RL:

Gradient flow

Gradient descent in the limit where the step size approaches zero.

On the Paradox of Learning to Reason from Data

Paper. Shows that BERT can’t generalise on and-only SAT problems. This is despite having the model complexity to handle these functions. However, during use chain of thought.

A General Theoretical Paradigm to Understand Learning from Human Preferences

Paper. Shows a unified framework for RLHF and DPO, \(\Psi\)PO, and introduces a new learning algorithm IPO that has better properties: it works offline (like DPO) and is not sensitive to deterministic preferences.

Interesting tidbit from this paper: One reason given for the instability of RLHF is that it has this sensitivity to deterministic preferences, but they counter this by using heavy regularisation when fitting the reward function.

Are Emergent Abilities of Large Language Models a Mirage?

Paper. Shows that jumps in metric values as you scale up models, aka emergent capabilities, are actually a product of nonlinear metric choice.

This suggests that we may be able to measure new capabilities ahead of time.

Calibrating Sequence likelihood Improves Conditional Language Generation

Paper. Assess the sequence-level likelihood of various completions, and trains against this metric. Finds that this does away with deciding errors that have to be patched by things like length normalisation and repetition penalties.