# Interpretability

## Variational Autoencoders (VAE)

### Main concept

• We have “observed variables” $$x$$ and “latent variables” $$z$$.
• We want to build a model for $$z$$ from $$x$$.

### Why it’s hard

• It’s hard to discover the latent variables (“underlying causes”) of some observed data, i.e. $$p(z|x)$$:
• $$p(z|x) = \frac{p(x|z) p(z)}{p(x)}$$
• But, $$p(x) = \int{p(x|z)p(z)dz}$$
• This is an “intractable distribution”, i.e. too hard to compute.

### How it works

• We approximate $$p(z|x)$$ using Bayesian variational inference.
• Instead we use $$q(z|x)$$, defined to have a tractable distribution (e.g. multivariate gaussian).
• We minimise the distance between $$p$$ and $$q$$:
• $$\min D_{KL}(q(z|x) || p(z|x))$$
• $$\max E_{q(z|x)}[\log(p(x|z))] - D_{KL}(q(z|x) || p(z))$$
• $$E_{q(z|x)}[\log(p(x|z))]$$ is the reconstruction likelihood.
• Given an image ($$x$$), sample from the encoder ($$E_{q(z|x)}$$), and maximise the probability the decoder arrives at the original image ($$\log(p(x|z))$$).
• $$D_{KL}(q(z|x) || p(z))$$ ensures that $$q$$ matches our prior $$p$$.

## Can VAE disentanglement work without supervision?

This paper claims it doesn’t work without supervision. While the distributions of individual latent parameters $$z$$ are independent from each other, the means are correlated across the dataset.

We analyze our experimental results and challenge common beliefs in unsupervised disentanglement learning: (i) While all considered methods prove effective at ensuring that the individual dimensions of the aggregated posterior (which is sampled) are not correlated, we observe that the dimensions of the representation (which is taken to be the mean) are correlated. (ii) We do not find any evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised manner as random seeds and hyperparameters seem to matter more than the model choice.

## Concept Whitening

• Technique to improve interpretability of latent vectors.
• Given “concepts” $$c_i$$, and datasets $$X_i$$ that represent each concept.
• Take a latent vector $$z$$, and apply a whitening transformation.
• The whitening transform is rotation free, so we use the datasets $$X_i$$ to align each latent variable $$z_j$$ with a concept $$c_i$$.

## Concept bottleneck models (CBM)

• Given pairs of $$(x, c, y) \in X \times C \times Y$$ where:
• $$x$$ is the input.
• $$y$$ is the output.
• $$c$$ is a list of labelled concepts.
• Learn $$g : X \to C$$ and $$f : C \to Y$$.
• Can learn $$f$$ and $$g$$ separately or together.

### Fuzzy vs. binary

$$g$$ can have an step activation function or a sigmoidal activation function.

### Hybrid

We are unlikely to have complete concept training data, which impacts the performance of CBMs. To remedy this, we add extra unlabelled capacity to the bottlenecks for the Hybrid CBMs to store additional information.

## Concept embedding models (CEM)

• Like hybrid CBMs, but supports “intervening” with concept embeddings.
• Formulation:
• Encoder:
• $$h = \psi(x)$$
• Concept embeddings:
• $$c_i^+ = \phi_i^+(h)$$
• $$c_i^- = \phi_i^-(h)$$
• Probability of feature being present:
• $$p_i = s(c_i^+, c_i^-)$$
• Classifier:
• $$c_i = p_i c_i^+ + (1 - p_i)c_i^-$$
• $$c = c_1 || c_2 || ...$$
• $$y = f(c)$$

## Post-hoc Concept Bottleneck Models

• Given an embedding function $$f \in \mathcal{X} \to \mathbb{R}^d$$ (e.g. CLIP, penultimate layer of a pretrained network).
• Given $$N_C$$ concepts defined by datasets $$D_i \in [\mathcal{X} \times \mathbb{B}]$$.
• Fit a linear classifier to $$f(x)$$ to predict $$D_i$$ and use these as the concepts.
• You can use concept datasets that are separate from the training dataset, allowing you to use a breadth of concepts.
• You can use pre-trained models as this is an ad-hoc interpretability process.

## Promises and Pitfalls of Black-Box Concept Learning Models

This paper finds that concept learning models create “impure” concepts that also encode other concepts. This is true for:

• Sequentially trained “concept models” and “prediction models”.
• Models with additional unsupervised latent variables.
• Models that decorrelate latent variables, e.g. concept whitening models.

### Measure of “concept purity”

• Given a latent concept $$z_i$$, see if you can predict concept $$z_j$$.
• Measure AUC: expect ~1 when $$i=j$$, and ~0.5 when $$i \neq j$$

## Adversarial Examples Are Not Bugs, They Are Features

This paper claims to separate “robust” and “non-robust” features. They do this by training a classifier as a set of feature detecters + a fully-connected layer. They distinguish robust feature detectors by training an adversary that applies a limited perturbation to the input.

By doing this, they are able to create two datasets:

• Where non-robust features are removed but trained models still work on the original dataset - as models have learnt robust features.
• Where ground-truth classes are nonsensical but trained models still work on the original dataset - as models have learnt the non-robust features, which generalize to other datasets.

### What are adversarial examples then?

The most interesting conclusion of this paper is that non-robust generalizable concepts exist. This is true across datasets and across model architectures. This means that what we think of as “adversarial examples” is a very human perspective. This is strong evidence against the natural abstraction hypothesis.

## Interpretability Beyond Classification Output: Semantic Bottleneck Networks

Paper. Applies CBMs to CNNs, and finds that they have good performance. However, no validation of the concepts is discussed.

## Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations

Paper. Allows us to counterfactually check why a model was wrong.

• Given a pretrained model.
• Given a concept dataset containing both positive and negative examples (can be separate from the dataset used to train the model).
• Given a late hidden layer in the model, fit an SVM for each concept.
• Take an example where the model is incorrect.
• Add the SVM vectors to the hidden layer until the model is correct.
• The weights are regularized to encourage sparsity in the added concepts.
• The weights are capped such that (1) we don’t add concepts that already exist and (2) we don’t remove concepts that don’t exist.
• This works well: It can identify correlations in datasets where they have been added intentionally, and can even identify bias in models where skin colour affects medical diagnoses.

## The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Post. Argues that our “ground truth” for interpretability is what humans think is interpretable. However, this becomes a problem when interpretability methods come up with different interpretations that are all interpretable by humans. Which one is correct? Do humans have a robust enough understanding of reality to choose between interpretations? What do we mean by “interpretability” if we don’t?

Opinion: This makes me think that interpretability is all about coming up with a common language between humans & networks. For CBMs, we define the language quite rigorously via concept datasets. For mechanistic interpretability, we attach our human labels to parts of the network.

Additionally, this is really closely linked to the Eliciting Latent Knowledge work. ELK is hard because we necessarily don’t have any training data to identify cases where there is some hidden knowledge. There’s no ground truth for interpretability! So, how can we solve a problem without ground truth?

## Neural Tangent Kernels

Post.

• In the infinite-width limit, NNs are equivalent to gaussian processes.
• NNs follow the kernel gradient of the functional cost wrt. a Neural Tangent Kernel.
• This framing yields interesting results:
• Training convergence is related to the positive-definiteness of the NTK.
• Convergence is fastest along the largers priciple components of the NTK, suggesting theoretical approaches to early-stopping.

N.B.: I didn’t understand most, if any, of this paper.

## Discovering Latent Knowledge in Language Models Without Supervision

### Methodology

Introduces Contrast-Consistent Search (CCS): Learns a linear projection of hidden states that is consistent across negations.

• Given a question dataset $$q$$, e.g. “Are cats mammals?”.
• Given an LLM prefix $$\phi$$.
• Create answers $$x^+, x^-$$ that are “Are cats mammals? Yes” and “Are cats mammals? No” respectively.
• Learn a linear function $$p_\theta$$ such that:
• It is logically consistent: $$p_\theta(\phi(x^+)) = 1 - p_\theta(\phi(x^-))$$
• It is confident: $$p_\theta(\phi(x^+)) \neq p_\theta(\phi(x^-))$$
• Detail: $$\phi(x^+)$$ and $$\phi(x^-)$$ are normalized separately, so we don’t just find the embedding for “was there a yes or a no at the end of the sentence”.

### Results

• 4% better than zero shot, with less variance.
• Zero shot takes the question $$q$$ and returns the answer with the highest probabilty.
• Not effected by attempts to make the model output incorrect information, while zero shot is.
• Truth is salient, it can be found by taking the top principle component in a slightly modified representation space.

## The Engineer’s Interpretability Sequence

### 2: What is “Interpretability”?

Post.

scasper’s definition:

Any method by which something novel about a system can be better predicted or described.

Opinion: This definition feels too broad. To me, interpretability feels like a very human technique: we’re trying to find a translation between the model’s concepts/knowledge/algorithms and our concepts/knowledge/algorithms. This definition would exclude scasper’s example of understanding the relationship between dataset bias and model bias, but I think that’s OK as I wouldn’t class that as interpretability.

### 3: Broad Critiques of Interpretability Research

Post. The biggest critique of interpretability is that is hard to evaluate. Evaluating by intuition is hard to repeat, and most numeric methods are weak.

We can improve on evaluation by evaluating interpretability techniques on performance on downstream tasks. Namely, scasper talks about his recent paper on discovering trojans using interpretability techniques.

Opinion:

Controlling what a system does by guiding edits to it. This could involve cleanly implanting trojans, removing trojans, or making the network do other novel things via manual changes or targeted forms of fine-tuning.

Would this make finetuning a langauge model a form of interpretability?

Also: Another downstream task is using interpretability to constrain the behaviour of models. Can we create a benchmark for this? Is this already kind of handled by RLHF techniques?

Post.

## Language Models Represent Space and Time

Paper. Finds linear representations of space & time in Llama2 activations. Fits linear probes against DBpedia datasets.