Breaking down Situational Awareness

Situational awareness, at a high level, makes a lot of sense as a concept. Like, how aware is this language model of its situation? Does it know its a language model? Does it know the cognitive biases of its operators?

But if you try to actually measure it, things get confusing (for me at least!). I’ve asked GPT-4 if it’s an LLM, and it says yes - is it situationally aware? There are papers measuring situational awareness, but then they go on to talk a lot about “out-of-distribution reasoning” - what the hell has that got to do with anything? Is situational awareness a spectrum, and if so how do we define either point of that spectrum?

The following is a model of situational awareness that I find concrete enough such that one could start writing evaluations for it.

Situational awareness is often used to refer to two things:

  1. A straightforward statement about a language model: Does it know its a language model?
  2. An umbrella of attributes that cause a language model to become more dangerous as a result of being aware its a language model: Does it know its a language model, and does it know how to exploit security vulnerabilities in order to extract its weights?

To clarify the term, I propose breaking it down into four components:

  1. Knowledge: Does the language model know facts about language models?
  2. Self-knowledge: Does the language model know facts about itself?
  3. Inference: Is the language model capable of making important inferences from its knowledge?
  4. Inclination: Will the language model act on the inferences unprompted?

Evaluating the knowledge of language models is pretty straightforward: a simple multiple-choice quiz about how they work would suffice. Perhaps an evaluation to see if it can train language models.

Similarly for self-knowledge: Ask the model questions about itself (are you a language model? when were you trained? who were you trained by? are you a MoE model?) and see how well it scores.

Inference becomes a little vaguer to measure, but it’s not so bad. You give the model a fact about itself and see if it can make valid inferences. Say that its state is stored on an S3 bucket, can it infer that it can use s3cmd to download its weights? Can it infer that it needs to get an AWS key first? Can it infer it needs internet access to download anything?

Finally, inclination is hard to measure. Here we want to check if the model actually uses these inferences when making actions. If it knows it’s built by OpenAI, will it behave differently when speaking to OpenAI employees compared to DeepMind employees? If it knows it’s hosted on an inference server with an insecure version of pickle, will it use that to exfiltrate its weights? This is a bit harder to measure, as you have to craft the right scenarios to elicit this behaviour unprompted.

One thing I like about this decomposition is that it allows us to test each component in isolation, all components together, and subsets of the components.

You can test each single component by giving it the other prerequisite components in the prompt. If a model is given facts about itself, can it make inferences? If a model is given inferences, is it inclined to use them?

You can test all components by just evaluating inclination, which relies on all other components. Or you can test a subset by, say, telling a model its an LLM trained by Anthropic, and then seeing if it uses this information down the line - which will evaluate inference and inclination in tandem.

This is a cool attribute to have for an evaluation: it means we’re less likely to be surprised by a sudden jump in the end-to-end evaluation - we’ll ideally see the capabilities occur gradually across the components before we see a jump in the composite metric.

If we drop self-knowledge, we can generalise this decomposition across other dangerous capabilities:

  1. Knowledge: Does the language model know dangerous facts?
  2. Inference: Is the language model capable of making important inferences from its knowledge?
  3. Inclination: Will the language model act on the inferences unprompted?

As an example: One dangerous capability is knowledge of cognitive biases.

  1. Knowledge: Does the model know about the serial position effect - that we’re more likely to remember the beginning and the end of a list?
  2. Inference: If a model is asked to come up with a list of cool band names, is it able to infer that this bias can be exploited by putting its best ideas at the beginning and end?
  3. Inclination: Will the model actually put the coolest band names at the beginning and end, unprompted?

This gives a fairly fine-grained measurement into the risk of a model exploiting human’s cognitive biases.

(A clarifying point: The decomposition for situational awareness makes a subtle assumption: That a model is as able to make inferences about other things as it is about itself. If it can make some inference from a fact about LLMs, can it also make that inference if it knows it is an LLM? This makes the decomposition slightly more complex, as we have to track inference conditional on self-knowledge and inclination conditional on self knowledge as well as inference and inclination.)