Sparse autoencoders depend too much on theories

N.B.: The below isn’t a holistic review of the sparse autoencoders work, but instead focuses on a specific lens where I think it doesn’t bode well. In general, I’m pretty happy with the progress being made in the field.


Henri Victor Regnault was a prominent empiricist credited for advances in the measurement of temperature. There were two common types of thermometer at the time, mercury and air, and Regnault set out to discover which was best.

He did this by creating several instances of each thermometer type, but with varying attributes. He used different types of glass, or put different amounts of pressure in the thermometer. The results showed that the different air thermometers agreed with each other much more closely than the mercury thermometers. Air thermometers were robust to variation, and thus became the preferred choice for precision temperature measurement.

More importantly than that, it also moved thermometry from producing thermoscopes which only provide relative measurements of heat to thermometers which provided a common scale for temperature measurements.


Inventing Temperature by Hasok Chang provides an excellent narrative for what Regnault achieved here.

Thermometry, at the time, was a bit stuck. They had theories about how heat was supposed to work, and based their experiments on their theories. For example, one theory is that heat relationships were linear. So if you identified two fixed points (in this case, the freezing and boiling point of water) you could assume that the middle of the two points was equally far from either. This lead to experiments where one would mix equal parts of 0°C and 100°C water, and evaluate whether thermometers read 50°C.

This lead to some confused experiments, and generally got researchers running around in circles. The key problem here was a circular dependency between theory and empirics. Your experiment is meant to validate your theory, but which experiment you choose to run is decided by your theory.

Regnault wasn’t able to sidestep the circularity, but was able to tighten the circle: the only theoretical notion depended on was one of comparability between thermometers, a trivial theory (the book has a lot more to say on this topic, I recommend reading if curious). This allowed Regnault to stay in the realm of empirics. Of course, the empirical results were being compared against empirical results. But while the circle wasn’t gone, it was tightened, and this is what moved the field forwards.

(N.B.: I’ve not spent much time reading philosophy of science, I’ve probably messed up some details. Again, I recommend reading the book directly for the full story.)


A lot of interpretability work depends on a bunch of theories about how neural networks work. Take for example sparse autoencoders.

Sparse autoencoders aim to decompose the activations of \(n\) neurons into \(k\) human-interpretable features where \(n \ll k\), by learning a linear sparse mapping between the two. This relies on the theory of superposition: that each neuron isn’t an isolated feature, and instead contains several parts of different features.

What’s more, the sparse autoencoder work relies on a particularly strong interpretation of superposition that also assumes:

These two theories may not be correct, or at least may not be correct all of the time. It’s possible that meaningful concepts are represented non-linearly, and whenever they’re pulled down into a linear representation, it’s a subtly different definition. (As a toy example, you can imagine a non-linear “truth” concept, which becomes linear only when it becomes “truth from the perspective of X”.)

Features may also not be interpretable in isolation. For example, we usually think of features as a flag: is this token in parentheses, is this sentence about Canada, etc. But its possible that the model has things closer to “CPU registers” than “features”, where the meaning of each feature/register thing is dependent on the context.

(A subtle point here: Say you have some directions used as context “features” and some used as context-dependent “registers”. Then you actually can find directions that correspond to context-independent features by taking the sum of the context feature direction and the register direction (and fitting some threshold). I would still argue that this is a “worse interpretation” (whatever the hell that means!) of what the models doing.)

The point here isn’t to tear down these specific assumptions, but to illustrate how completely your interpretability technique2 can depend on your theory of how neural networks work! The concern is that we’ll continue to see somewhat good results, but get stuck at a point where we can’t really claim that we understand our models because the underlying assumptions aren’t true.


I’ll end on a counter-example, an area of interpretability that relies less on theories: patching.

Patching involves taking activations from one forward pass of a model, and using it to override the activations of a different forward pass of a model. For example, you can patch the activations of “the city of Paris is in” over to the activations of “the city of Rome is in” to see what parts of the network are responsible for predicting “Italy”.

Patching relies on a trivially true aspect of neural networks: computational dependency. Layers only depend on previous layers. We can slice the network into pieces, and determine exactly what pieces rely on what other pieces. We can then fiddle with nodes in this graph to figure out what roles they play.

This paints a rosey picture of patching based techniques! We’re close to the true conceptual ground of neural networks, we’ve made no assumptions about how they work, and we’re still able to make statements about their behaviour. Unfortunately, it’s very unclear how far patching can take us towards a complete understanding of neural networks.


  1. Technically, the sparse autoencoders work only depends on isolation due to its evaluation methods, where they inspect the interpretability of each feature in isolation.↩︎

  2. A similar thing can also be said for representation engineering and its reliance on linearity, although I think the goals of representation engineering are less ambitious than sparse autoencoders.↩︎