# Preventing steganography in latent vectors

**N.B.: A more complete write up is here
on LessWrong.**

*This work is done in collaboration with Hoagy
Cunningham*.

There is a lot of prior research into manually assigning labels to latent dimensions (for example, concept whitening). However, in the extremes of optimization, neural networks may be incentivized to encode additional features into these labelled dimensions - especially if the size of the latent space is constrained. This can be hard to detect: the latent space will not significantly diverge from the labelling, but still conceal additional information.

What we end up with is steganography in the latent space. A latent vector seems to represent human-understandable features, but the network is able to extract some information hidden in the vector.

This research investigates preventing this steganography.

## Eliciting steganography

We are able to design toy examples that elicit this
behaviour. We train an autoencoder to reconstruct \(2n\)-digit binary strings
using a latent space of \(\mathbb{R}^n\). On top of
the reconstruction loss, we introduce a
**representation loss**. This incentivizes
the model to use the \(n\)th latent variable to
represent the \(n\)th
binary digit. The \((n+1)\)th to \((2n)\)th binary digits have
no representation cost.

```
= random_binary_vector() # shape=2N
vector = encoder(vector) # shape=N
latent_vector = decoder(latent_vector) # shape=2N
reconstructed_vector
= torch.nn.MSELoss()(vector, reconstructed_vector)
reconstruction_loss = torch.nn.MSELoss()(latent_vector, vector[:N])
representation_loss = reconstruction_loss + representation_loss loss
```

We find that, perhaps unsurprisingly, the autoencoder learns to “hide” the second half of the vector in the representation of first half. This holds even if we weight the representation loss higher, or make the representation loss sparse. It also holds if we add regularization (dropout, L1, L2) or add noise to the latent space.

## Preventing steganography

We are actively experimenting with different
methodologies to remove the steganography. These
methodologies rely on the core insight that
**independently trained models will have the same
representation for vector[:N], but
independent representations for
vector[N:]**. By training multiple
models, and shuffling encoders & decoders, we can

*filter out*the steganography and keep the desired representation.

As an example, we use this insight in the following methodology:

- Train \(M\)
encoder/decoder pairs:
- Use both the representation & reconstruction losses.
- Expect:
- Good performance on the representation loss
- Good performance on reconstructing
`vector[:N]`

. - Good performance on reconstructing
`vector[N:]`

-*hidden information was encoded in the latent space!*

- Discard the encoders.
- Train a new encoder:
- Use only the representation loss.
- Randomly sample which of the \(M\) decoders to use on each step.
- Expect:
- Good performance on the representation loss
- Good performance on reconstructing
`vector[:N]`

. - Bad performance on reconstructing
`vector[N:]`

-*hidden information was filtered out!*

The intuition is that the new encoder can’t simultaneously hide information for all \(M\) decoders, so it doesn’t hide any information at all.

We find that this methodology works in the toy example, and performs better with increasing numbers of trained models.

## Motivation for this research

We believe this work is useful under three related framings.

We are aiming to **improve the robustness of
interpretability tools** by ensuring a one-to-one
mapping between human-labelled features and latent space
variables. This can be done by filtering out all
non-human-labelled information from the latent space, as
described above.

Additionally, this is a **first step towards a
potential solution to the Eliciting
Latent Knowledge problem**. Hoagy has written
about this here
and here.
Briefly, we can use this methodology to constrain a
“note taking” system to only write interpretable
statements.

The third motivation is to **build secure ML
tools**. We hope that, in the end, this approach
will not rely on the incentives of cost functions to
avoid undesired behavior, but instead rely on provable
statements about what information is shared between
models.