Machine learning

Image generation

Quantizing vectors


Paper. Creates joint text and image embeddings from 400M image and text pairs. This is done by training embeddings for the pairs to have low cosine distance, and training non-pairs to have high cosine distance.

Diffusion models

Tutorial, more math.

Guided diffusion

Given a classifier \(p_{\phi}\) and classes \(y \in Y\), we define a new noise-reversion method: \[ \hat{\mu}_{\theta}(x_t | y) = \mu_{\theta}(x_t) + s \Sigma_{\theta}(x_t) \nabla_{x_t}log(p_{\phi}(y | x_t)) \]

This updates the image means towards being classified correctly. \(s\) controls the weighting of the guidance, trading off between quality & diversity.

Classifier-free guidance

We train \(\epsilon_{\theta}(x_t | y)\). To maintain diversity we don’t use this directly, and instead use: \[ \hat{\epsilon}_{\theta}(x_t | y) = \epsilon_{\theta}(x_t | \emptyset) + s( \epsilon_{\theta}(x_t | y) - \epsilon_{\theta}(x_t | \emptyset) ) \]

CLIP guidance

We train CLIP embeddings with noisy (and non-noisy) images. When diffusing, we use the image & caption CLIP embeddings to guide the images to an embedding that matches the caption:

\[ \hat{\mu}_{\theta} = \mu_{\theta}(x_t | c) + s \Sigma_{\theta}(x_t | c) \nabla_{x_t}(f(x_t) \cdot g(c)) \]

where \(f\) and \(g\) are the image & caption CLIP embedding functions.

GLIDE (precursor to DALLE-2)

Paper. They train a classifier-free guided diffusion model. They also train an up-sampling diffusion model. Text conditioning is done using a Transformer. They find that classifier-free guidance works better than CLIP guidance.

Unconditional image generation

The GLIDE training data has 20% of its samples with no text. This allows the model to produce images without any captions.

Image in-painting

The GLIDE diffusion models are fine-tuned to perform in-painting. The diffusion models take the images with (1) the RGB channels of the noisy image (2) the RGB channels of the masked image and (3) the boolean mask.

Reinforcement learning

On-policy vs. off-policy RL

Source. On-policy learns values conditioned on the current policy being followed. Off-policy learns values based on potentially different policy being followed (e.g. Q-learning updates Q-values based on taking the greedy action in the next step).

The difference between on-policy and off-policy collapses if a greedy policy is always being followed. This doesn’t happen in practice, as greedy policies never explore.

In-context Reinforcement Learning with Algorithm Distillation

Paper. Fit a transformer on RL learning trajectories. The transformer can then learn out-of-distribution tasks. The performance isn’t quite as good as the underlying RL algorithm, but it can be more sample efficient: you can subsample the training data so the transformer learns to imitate a “faster” RL algorithm.

Time Series Forecasting With Deep Learning: A Survey

This paper surveys RNNs, CNNs, and attention-based methods for time series forecasting. One interesting conclusion is that hybrid models are winning the competitions. This might be because adding constraints to the models allows learning from smaller datasets.


Mainly taken from A Survey of Transformers.

High-level architecture

Attention modules

\[ \text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^\top}{\sqrt{D_k}}) V = AV \] where:

Attention complexity

\(O(T^2D)\) where \(T\) is the input/output length, and \(D\) is the key/value length.

Attention num parameters

\(4D^2\) where \(D\) is the key/value length.

Multi-head attention

\[ \text{MultiHeadAttention}(Q, K, V) = \text{Concat}( \text{Attention}(QW^Q_0, KW^K_0, VW^V_0), ..., \text{Attention}(QW^Q_H, KW^K_H, VW^V_H), ) W^O \] where:

Position-wise feed-forward network

\[ \text{FFN}(H) = \text{ReLU}(HW_1 + b_1)W_2 + b_2 \]

PWFFN complexity

\(O(TD^2)\) where \(T\) is the input/output length, and \(D\) is the key/value length.

PWFFN num parameters

\(8D^2\) where \(D\) is the key/value length.

Encoder and decoder use examples

Causal transformer

These are the same as “auto-regressive” transformers, where the task is to predict the next input token given the previous tokens.

Vision transformers (ViT)

This paper applies Transformers to computer vision in a very simple way:

  1. Segment the image into patches.
  2. Flatten each patch’s pixels.
  3. Prepend a learnable “class” token, to signify to the model that this output should be the class of the image.
  4. Run a Transformer over the flatten patches.
  5. Add a 1d position embedding. 2d position embeddings don’t affect performance.
  6. If doing classification, use an encoder-only Transformer and take the first output’s output embedding as the class.


This paper introduces Imagination with auto-Regression over an Inner Speech (IRIS). This is a sample efficient RL training scheme.

Elastic net regularization

Elastic net regularization is a regularization term than linearly weights the \(L_1\) and \(L_2\) terms of some parameters.

Techniques for Training Large Neural Networks


Data parallelism

Pipeline parallelism

Tensor parallelism

Mixture-of-Experts (MoE) parallelism

Kernel trick

Gaussian process

Gaussian processes are infinite-dimensional gaussian distributions that use a kernel function to define covariance instead of a matrix. The kernel function can be thought of as a prior over functions.

They can be useful for modelling (e.g. for regression) functions \(f : [0, 1] \to [0, 1]\), where the dimensions is the infinite set of inputs \([0, 1]\).

Regression (perfect samples)

We can define this as drawing from the multivariate gaussian distribution, i.e. “discretizing” the gaussian process.

\[ \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \sim \mathcal{N}( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11}, \Sigma_{12} \\ \Sigma_{21}, \Sigma_{22} \end{bmatrix} ) \]

where \(\mu_i = \text{mean}(X_i)\) and \(\Sigma_{ij} = k(X_i, X_j)\).

We can then get the conditional distribution: \[ p(y_2 | X_2, y_1, X_1) = \mathcal{N}(\mu_{2|1}, \Sigma_{2|1}) \\ \mu_{2|1} = \mu_2 + \Sigma_{21} \Sigma_{11}^{-1} (y_1 - \mu_1) \\ \Sigma_{2|1} = \Sigma_{22} - \Sigma_{21} \Sigma_{11}^{-1} \Sigma_{12} \]

Regression with noisy samples

As above but we model the noise of the observations by adding a value to the observation’s variance. \[ \Sigma_{11} = k(X_1, X_1) + \sigma_\epsilon^2 I \]

L2 norm

\[ \sqrt{\sum_i x_i^2} \]

Standard deviation/variance

\[ \text{variance} = \sigma^2 = \frac{1}{n} \sum_i (x_i - \mu)^2 \\ \text{standard deviation} = \sigma \]