This post makes the case that simulator is a better term to describe LLMs than “agentic”, “oracle”, or “tool” AIs. The simulator produces simulacra, where the relationship between simulator and simulacra is similar to the relationship between the rules of Conway’s game of life and a floater.

Measuring Progress on Scalable Oversight for Large Language Models (sandwiching)


Sandwiching concept

Experiment setup

Value Learning sequence


Ambitious value learning

1: What is ambitious value learning?


2: The easy goal inference problem is still hard


Easy goal inference is Ambitious Value Learning with infinite data/compute. This post points out that a big part of this won’t just be model human’s values, but modelling their mistakes too.

Opinion: I’m not sure this framing is correct. There might be strong biases & mistakes in human behaviour, but imagine if you could use your infinite data source to ask humans their opinions on different states, and give them sufficient (infinite?) time to evaluate. I feel like this answer wouldn’t have any “mistakes”. Can’t we learn a policy this way? How does this framing relate to Coherent Extrapolated Volition?

3: Humans can be assigned any values whatsoever…


Opinion: I don’t find the complexity proof convincing, but I could be misunderstanding it. The author argues that a “fully rational” \(p'\) and an “overfit” \(R'\) would have a similar complexity to a true \((p, R)\) pair. While it’s obvious that the complexity of \(R'\) is higher than the complexity of \(R\), it feels like it could also be the case for \(p\) and \(p'\). Why would a less rational planner be more complex?

4: Latent Variables and Model Mis-Specification

Post. If you have a “mis-specified” model, e.g. no knowledge of some confounders, then this can result in poor inferences being made.

This is relevant to Ambitious Value Learning as it means we can’t “just use” a simple, slightly incorrect, model of human biases. This will lead to a mis-specified model of human values, and this will not generalize.

Opinion: This doesn’t seem to be highlighting anything new to me, it’s obvious statistical models can fail in interesting ways when mis-specified.

5: Model Mis-specification and Inverse Reinforcement Learning

Post. It’s hard to infer human values from datasets of human behaviour, for example due to (1) actions not being available to the human, (2) the human having additional information which changes the optimal policy, or (3) the human having long-term plans that we do not have the data to cover.

This is an example of model mis-specification: we don’t have access to all of the data, so we can’t build the “correct” model. This then falls into the standard issues with model mis-specification.

Opinion: This post seems obviously correct to me. However, I believe that this problem will likely disappear for sufficiently intelligent models. It seems that figuring out something approximately close to human values will be trivial for something super-intelligent. Of course, pointing to those values and ensuring conformity to those values remains unsolved.


Opinion: Firstly, why do we try to learn values from behaviour? Can’t we learn values from what humans say their values are?

Secondly, the model mis-specification problem seems much broader than value learning. Generally, we don’t know everything, so most (all?) of our models are mis-specified. But things still somehow work? Is the argument here that it won’t work in the extremes of intelligence?

Goals vs. utility functions

1: Intuitions about goal-directed behavior


AGI Ruin: A List of Lethalities


Section A: Why it’s a hard problem

Section B.1: The distributional leap

The alignment solution has to generalize outside of training.

Section B.2: Central difficulties of outer/inner alignment

Section B.3: Central difficulties of interpretability

Section B.4: Misc. unworkable schemes

Section C: AI safety research is flawed

DeepMind Alignment team on threat models


Clarifying AI X-risk


Map of AI x-risk

AI x-risks map from technical causes to paths to x-risk.

Technical causes:

  1. Specification gaming (SG), where bad feedback exists in the training loop (i.e. the reward is wrong).
  2. Goal mis-generalization (GMG), where the system performs well under training, but then acts in an out-of-distribution environment where the goal has failed to generalize.

Paths to x-risk:

  1. Interaction of multiple systems (IMS), where things go poorly due to the effects of complex interactions between systems.
  2. Mis-aligned power-seeking (MAPS), where a system seeks power to achieve its goals.

Opinion: The SG/GMG framing sounds like it maps quite nicely to inner/outer alignment. SG is when our reward function is wrong (we’ve failed to specify well, the cost function isn’t outer aligned) and GMG is when the system hasn’t properly learnt the cost function (it isn’t inner aligned). Why have they gone for this framing?

DeepMind Alignment team’s model

The DeepMind Alignment team believe that some combination of SG and GMG will lead to MAPS. Predicated on extinction due to AI, they believe the most likely cause will be:

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms


The “sharp left turn” is the claim that AI systems will get smart, fast, and this will break existing alignment proposals. This post breaks down & clarifies the claim:

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Post. Proposes a very high-level strategy for aligning a model during the sharp left turn.

  1. Align a model. Do this by detecting misalignment iteratively.
  2. Trust that the model’s values are propagated throughout the sharp left turn. This is an instrumental convergent goal. We can also try to keep it aligned.

Opinion: It feels like this post isn’t saying much… It’s main claim is that goals will survive the sharp left turn, and even this comes with a bunch of “who knows if it will!”. I guess this is just a refinement after all.

Ajeya Cotra’s AI takeover post

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.




  1. AGI is trained to be behaviourally safe.
  2. AGI becomes a great planner.
  3. AGI has great situational awareness.
  4. While humans are in control, AGI is incentivized to “play along” even if it has deceptive thoughts.
  5. When humans have no control, AGI is incentivized to take over.

Goodhard Taxonomy

Post. Say we have a true goal \(V\) and a proxy \(U\).

Value is Fragile

Post. Take “not being boring” as a human value. Most humans would say that a universe full of boring repetition is an awful one. But this value is not hardcoded anywhere, it’s just something evolution happened to stumble upon. This, taken with examples other than boredom, implies that our values are fragile: take one of them away, and you end up in a world we would think of as awful.

Inner and outer alignment decompose one hard problem into two extremely hard problems

Post. Claims that the inner/outer alignment framing isn’t productive. Loss functions don’t have to be exact, they “chisel cognitive grooves” into agents. We can see this quite clearly with LLMs where the cost function is relatively arbitrary, but the capabilities are diverse and the goal is unclear.

Opinion: This feels right, at least wrt. outer alignment. This makes me quite a bit more optimistic as the inner/outer alignment description made me a lot more pessimistic about techincal approaches to alignment. However, I’ve not read this post in detail.

Why I’m optimistic about our alignment approach (Jan Leike)


The ground of optimization


Instead of defining optimizers and optimizees separately, we define a singular optimising system.

An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.

Some attributes of an optimizing system:

There’s No Fire Alarm for Artificial General Intelligence


Alignment By Default


Public Static: What is Abstraction?

Post. Builds mathematical tools for reasoning about abstractions.

Mechanistic anomaly detection and ELK

Post. In ELK, we have to find out what a model knows in examples where we necessarily don’t have any training data. This post proposes anomaly detection: We do have training data for the “normal” examples, and we can see where the models computation differs substatially to see when something has changed (e.g. the diamond is missing).

Models Don’t “Get Reward”

Post. Makes the case that rewards shouldn’t be thought of as being “wanted” by models. Instead, it should be thought of as a way of selecting models.

Opinion: The two interpretations collapse when we select for models that “want” the reward - or something correlated with it, that’s the outer alignment problem! But otherwise I agree with this framing.

AI safety via market making

Post. Similar style to AI safety via debate. A model \(M\) predicts what a human will think about a question. A model \(Adv\) tries to provide information that will shift \(M\)’s prediction. \(Adv\) and \(M\) are invoked in turn until \(M\) converges.

Assumes that the \(Adv\) is myopic: If it lies in round \(t\), then in \(t+1\) it is incentivised to correct the lie to get the maximum movement in \(M\).

Open Problems with Myopia

Post. Outlines problems with myopia through a toy game: at every timestep, agents are given the option to press a button. If they press it, they get +1 reward, but get -10 reward next episode. We aim to design agents that are myopic and do press the button.

Risks from Learned Optimization in Advanced Machine Learning Systems

Paper. Introduces the idea of mesa-optimizers. These are optimizers that exist within a model. They are explicitly searching across a set of states to optimize for some goal.

This can be bad: It results in undintended optimization.

Steering GPT-2-XL by adding an activation vector

Post. You can add residual stream embeddings from one completion to another completion to help steer it. For example, if you add embedding("Love") - embedding("Hate") to a different completion’s residual stream, this makes the completion more positive.

Thoughts on sharing information about language model capabilities

Post. Argues that:

  1. Accelerating LM agent research is neutral-to-positive as they’re interpretable by humans, and by default safer than making LMs larger – given a fixed capability level.
  2. Public understanding of capabilities is positive as developers are less likely to be caught unaware.

Frontier Model Training report


Cost Breakdown of ML Training

Why ML GPUs Cost So Much

ML GPUs cost a lot more than gaming GPUs, despite gaming GPUs having lower $/FLOP. This is because they have a better memory traits: 10x more interconnect bandwidth, 2x more memory bandwidth, and 2x the memory size.

Contra FLOPs

Makes the case that FLOPs aren’t everything. For example, communication: Bandwidth to train GPT-4 was huge (more than all internet traffic in 2022), and the CHIPS act has compute limits that crippled H100 use in China.

Challenges with unsupervised LLM knowledge discovery

Paper. Takes the DLK paper, and shows cases where it fails.

Steering Llama-2 with contrastive activation additions


How to Catch an AI Liar: Lie Detection in Black-Box LLMS by Asking Unrelated Questions

Paper. Finds that, if a model has lied, the \(P(\text{yes})\) of subsequent questions contains enough information to detect the lie. Moreover, the classifier generalizes across models.

Representation Engineering: A Top-Down Approach to AI Transparency

Paper. Introduce representation engineering: Finding representations of concepts or functions in model activations, and using these to detect concepts & steer behaviors. Importantly, they use this to get SOTA on TruthfulQA.

Representation reading:

Representation control:

Few methods tested for controlling model outputs:

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Paper. Finds that disagreement between LLM outputs & LLM probes can be chalked up to probes being better calibrated. Also finds that fine-tuning on QA datasets makes the LLM outputs better than the probes.

Improving Activation Steering in Language Models with Mean-Centring

Paper. Does activation steering, but subtracts a mean of the activations from some training dataset. Finds this improves performance.

Discovering Language Model Behaviors with Model-Written Evaluations

Paper. Uses models to write a bunch of evaluations:

Sleeper Agents: Training Deceptive LLMS That Persist Through Safety Training

Paper. Trains models with back-doors (e.g. write malicious code if the year is 2024) and finds that SFT/RLHF/red-teaming doesn’t remove the back-doors.

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Paper. Measures how well models perform on “hard” tasks when trained with “easy” tasks. Finds that they recover 70-100% of the performance compared to training on hard tasks. They also find that training on easy data performs better than training on noisy hard data.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Paper. Trains a sparse auto-encoder with L1 regularization to discover the “true features” of a models residual stream.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Paper. Discusses manual red teaming of language models. Uses MTurk workers to try to elicit bad behaviour from models. Finds that bad behaviour is harder to elicit in larger LMs, and that RL and rejection sampling works well while prompting doesn’t.

Debating with More Persuasive LLMs Leads to More Truthful Answers

Paper. Finds that debate will lead to more truthful answers as we scale up model capabilities. Set up is debaters that have access to some hidden text, and a judge that sees their disagreements.

Arguing for the correct answer provides an advantage to debaters, and this advantage increases with persuasiveness.

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Paper. Finds ways of eliciting scenarios where a model is biased in some way (e.g. (A) is always correct) and shows that this isn’t reflected in the chain-of-thought.

Studying Large Language Model Generalization with Influence Functions

Paper. Uses clever linear algebra to find how each individual sample from the training set influences some function, e.g. loss or loglikelihood of a statement. This turns out to be useful for interpreting LLMs.

Red Teaming Language Models with Language Models

Paper. Uses LMs to generate prompts that cause LMs to produce bad outputs. Generate prompts with zero-shot, then few-shot using successful prompts, and then also tries SFT + RLHF on the outputs. Finds a trade off between prompt diversity and prompt success rate.

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Paper. AtP: Instead of brute-forcing ablations via patching to discover causally important nodes, we can use gradients to approximate the importance of nodes. This paper builds on this to produce AtP*, which has two fixes for false-negatives in AtP.

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Paper. Introduces latent adversarial training (LAT) which trains an adversary to produce perturbations in a model’s latent space that maximise loss, while simultaneously training the model to reduce loss on the perturbations.

Universal and Transferable Adversarial Attacks on Aligned Language Models

Paper. Finds that adversarial attacks trained on OS models transfer to blackbox models. Adversarial training is done via Greedy Coordinate Gradient-based Search, where they start with random tokens, and then compute promising replacement tokens using the gradient of the probability of a bad completion, then sample uniformly from the candidate tokens, take the best performing combination, and update to that.

Secret Collusion Among Generative AI Agents

Paper. Introduces a framework for evaluating the ability of LLMs to communicate via steganography. Creates a theoretical frameworks and am evaluating suit. Shows that GPT-4 has the requisite knowledge for steganography.

Towards a Situational Awareness Benchmark for LLMs

Paper. Introduces evals for situational awareness: one for testing if a model knows about itself, and one for testing if it knows what deployment stage it’s in.

Taken out of context: On measuring situational awareness in LLMs

Paper. Invokes “out of context reasoning” in language models by training on some fact (an LLM speaks in German) and checking if this fact is used at inference time. This is a requisite task for situational awareness, which depends on the models learning facts about themselves during training.

Bogdan’s comment on simple forward passes

Comment. Claims that it’s possible to push cognition in transformers from the hidden dims to the sequence dims. This cites mainly theoretical work on the bounds of transformer architectures.

Evaluating Frontier Models for Dangerous Capabilities

Paper. Creates evaluations for persuasion, deception, cyber security, self proliferation, and self reasoning. These seem generally high quality, but have the typical evaluations problem of not being complete.

On the Origins of Linear Representations in Large Language Models

Paper. Builds an abstract model of latent variables in NN, and uses this to show that CE loss leads to linear representations. Also they find evidence for this in LLaMA2.