Preventing steganography in latent vectors

N.B.: A more complete write up is here on LessWrong.

This work is done in collaboration with Hoagy Cunningham.

There is a lot of prior research into manually assigning labels to latent dimensions (for example, concept whitening). However, in the extremes of optimization, neural networks may be incentivized to encode additional features into these labelled dimensions - especially if the size of the latent space is constrained. This can be hard to detect: the latent space will not significantly diverge from the labelling, but still conceal additional information.

What we end up with is steganography in the latent space. A latent vector seems to represent human-understandable features, but the network is able to extract some information hidden in the vector.

This research investigates preventing this steganography.

Eliciting steganography

We are able to design toy examples that elicit this behaviour. We train an autoencoder to reconstruct \(2n\)-digit binary strings using a latent space of \(\mathbb{R}^n\). On top of the reconstruction loss, we introduce a representation loss. This incentivizes the model to use the \(n\)th latent variable to represent the \(n\)th binary digit. The \((n+1)\)th to \((2n)\)th binary digits have no representation cost.

vector = random_binary_vector()  # shape=2N
latent_vector = encoder(vector)  # shape=N
reconstructed_vector = decoder(latent_vector)  # shape=2N

reconstruction_loss = torch.nn.MSELoss()(vector, reconstructed_vector)
representation_loss = torch.nn.MSELoss()(latent_vector, vector[:N])
loss = reconstruction_loss + representation_loss

We find that, perhaps unsurprisingly, the autoencoder learns to “hide” the second half of the vector in the representation of first half. This holds even if we weight the representation loss higher, or make the representation loss sparse. It also holds if we add regularization (dropout, L1, L2) or add noise to the latent space.

Preventing steganography

We are actively experimenting with different methodologies to remove the steganography. These methodologies rely on the core insight that independently trained models will have the same representation for vector[:N], but independent representations for vector[N:]. By training multiple models, and shuffling encoders & decoders, we can filter out the steganography and keep the desired representation.

As an example, we use this insight in the following methodology:

Train \(M\) encoder/decoder pairs:
- Use both the representation & reconstruction losses.
- Expect:
  - Good performance on the representation loss
  - Good performance on reconstructing vector[:N].
  - Good performance on reconstructing vector[N:] - hidden information was encoded in the latent space!
Discard the encoders.
Train a new encoder:
- Use only the representation loss.
- Randomly sample which of the \(M\) decoders to use on each step.
- Expect:
  - Good performance on the representation loss
  - Good performance on reconstructing vector[:N].
  - Bad performance on reconstructing vector[N:] - hidden information was filtered out!

The intuition is that the new encoder can’t simultaneously hide information for all \(M\) decoders, so it doesn’t hide any information at all.

We find that this methodology works in the toy example, and performs better with increasing numbers of trained models.

Motivation for this research

We believe this work is useful under three related framings.

We are aiming to improve the robustness of interpretability tools by ensuring a one-to-one mapping between human-labelled features and latent space variables. This can be done by filtering out all non-human-labelled information from the latent space, as described above.

Additionally, this is a first step towards a potential solution to the Eliciting Latent Knowledge problem. Hoagy has written about this here and here. Briefly, we can use this methodology to constrain a “note taking” system to only write interpretable statements.

The third motivation is to build secure ML tools. We hope that, in the end, this approach will not rely on the incentives of cost functions to avoid undesired behavior, but instead rely on provable statements about what information is shared between models.