Variational Autoencoders (VAE)

Main concept

Why it’s hard

How it works

Can VAE disentanglement work without supervision?

This paper claims it doesn’t work without supervision. While the distributions of individual latent parameters \(z\) are independent from each other, the means are correlated across the dataset.

We analyze our experimental results and challenge common beliefs in unsupervised disentanglement learning: (i) While all considered methods prove effective at ensuring that the individual dimensions of the aggregated posterior (which is sampled) are not correlated, we observe that the dimensions of the representation (which is taken to be the mean) are correlated. (ii) We do not find any evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised manner as random seeds and hyperparameters seem to matter more than the model choice.

Concept Whitening


Concept bottleneck models (CBM)

Fuzzy vs. binary

\(g\) can have an step activation function or a sigmoidal activation function.


We are unlikely to have complete concept training data, which impacts the performance of CBMs. To remedy this, we add extra unlabelled capacity to the bottlenecks for the Hybrid CBMs to store additional information.

Concept embedding models (CEM)


Post-hoc Concept Bottleneck Models


Promises and Pitfalls of Black-Box Concept Learning Models

This paper finds that concept learning models create “impure” concepts that also encode other concepts. This is true for:

Measure of “concept purity”

Adversarial Examples Are Not Bugs, They Are Features

This paper claims to separate “robust” and “non-robust” features. They do this by training a classifier as a set of feature detecters + a fully-connected layer. They distinguish robust feature detectors by training an adversary that applies a limited perturbation to the input.

By doing this, they are able to create two datasets:

What are adversarial examples then?

The most interesting conclusion of this paper is that non-robust generalizable concepts exist. This is true across datasets and across model architectures. This means that what we think of as “adversarial examples” is a very human perspective. This is strong evidence against the natural abstraction hypothesis.

Interpretability Beyond Classification Output: Semantic Bottleneck Networks

Paper. Applies CBMs to CNNs, and finds that they have good performance. However, no validation of the concepts is discussed.

Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations

Paper. Allows us to counterfactually check why a model was wrong.

The Ground Truth Problem (Or, Why Evaluating Interpretability Methods Is Hard)

Post. Argues that our “ground truth” for interpretability is what humans think is interpretable. However, this becomes a problem when interpretability methods come up with different interpretations that are all interpretable by humans. Which one is correct? Do humans have a robust enough understanding of reality to choose between interpretations? What do we mean by “interpretability” if we don’t?

Opinion: This makes me think that interpretability is all about coming up with a common language between humans & networks. For CBMs, we define the language quite rigorously via concept datasets. For mechanistic interpretability, we attach our human labels to parts of the network.

Additionally, this is really closely linked to the Eliciting Latent Knowledge work. ELK is hard because we necessarily don’t have any training data to identify cases where there is some hidden knowledge. There’s no ground truth for interpretability! So, how can we solve a problem without ground truth?

Neural Tangent Kernels


N.B.: I didn’t understand most, if any, of this paper.

Discovering Latent Knowledge in Language Models Without Supervision



Introduces Contrast-Consistent Search (CCS): Learns a linear projection of hidden states that is consistent across negations.


The Engineer’s Interpretability Sequence

2: What is “Interpretability”?


scasper’s definition:

Any method by which something novel about a system can be better predicted or described.

Opinion: This definition feels too broad. To me, interpretability feels like a very human technique: we’re trying to find a translation between the model’s concepts/knowledge/algorithms and our concepts/knowledge/algorithms. This definition would exclude scasper’s example of understanding the relationship between dataset bias and model bias, but I think that’s OK as I wouldn’t class that as interpretability.

3: Broad Critiques of Interpretability Research

Post. The biggest critique of interpretability is that is hard to evaluate. Evaluating by intuition is hard to repeat, and most numeric methods are weak.

We can improve on evaluation by evaluating interpretability techniques on performance on downstream tasks. Namely, scasper talks about his recent paper on discovering trojans using interpretability techniques.


Controlling what a system does by guiding edits to it. This could involve cleanly implanting trojans, removing trojans, or making the network do other novel things via manual changes or targeted forms of fine-tuning.

Would this make finetuning a langauge model a form of interpretability?

Also: Another downstream task is using interpretability to constrain the behaviour of models. Can we create a benchmark for this? Is this already kind of handled by RLHF techniques?

4: A Spotlight on Feature Attribution/Saliency


Language Models Represent Space and Time

Paper. Finds linear representations of space & time in Llama2 activations. Fits linear probes against DBpedia datasets.