This post makes the case that simulator is a better term to describe LLMs than “agentic”, “oracle”, or “tool” AIs. The simulator produces simulacra, where the relationship between simulator and simulacra is similar to the relationship between the rules of Conway’s game of life and a floater.

Measuring Progress on Scalable Oversight for Large Language Models (sandwiching)


Sandwiching concept

Experiment setup

Value Learning sequence


Ambitious value learning

1: What is ambitious value learning?


2: The easy goal inference problem is still hard


Easy goal inference is Ambitious Value Learning with infinite data/compute. This post points out that a big part of this won’t just be model human’s values, but modelling their mistakes too.

Opinion: I’m not sure this framing is correct. There might be strong biases & mistakes in human behaviour, but imagine if you could use your infinite data source to ask humans their opinions on different states, and give them sufficient (infinite?) time to evaluate. I feel like this answer wouldn’t have any “mistakes”. Can’t we learn a policy this way? How does this framing relate to Coherent Extrapolated Volition?

3: Humans can be assigned any values whatsoever…


Opinion: I don’t find the complexity proof convincing, but I could be misunderstanding it. The author argues that a “fully rational” \(p'\) and an “overfit” \(R'\) would have a similar complexity to a true \((p, R)\) pair. While it’s obvious that the complexity of \(R'\) is higher than the complexity of \(R\), it feels like it could also be the case for \(p\) and \(p'\). Why would a less rational planner be more complex?

4: Latent Variables and Model Mis-Specification

Post. If you have a “mis-specified” model, e.g. no knowledge of some confounders, then this can result in poor inferences being made.

This is relevant to Ambitious Value Learning as it means we can’t “just use” a simple, slightly incorrect, model of human biases. This will lead to a mis-specified model of human values, and this will not generalize.

Opinion: This doesn’t seem to be highlighting anything new to me, it’s obvious statistical models can fail in interesting ways when mis-specified.

5: Model Mis-specification and Inverse Reinforcement Learning

Post. It’s hard to infer human values from datasets of human behaviour, for example due to (1) actions not being available to the human, (2) the human having additional information which changes the optimal policy, or (3) the human having long-term plans that we do not have the data to cover.

This is an example of model mis-specification: we don’t have access to all of the data, so we can’t build the “correct” model. This then falls into the standard issues with model mis-specification.

Opinion: This post seems obviously correct to me. However, I believe that this problem will likely disappear for sufficiently intelligent models. It seems that figuring out something approximately close to human values will be trivial for something super-intelligent. Of course, pointing to those values and ensuring conformity to those values remains unsolved.


Opinion: Firstly, why do we try to learn values from behaviour? Can’t we learn values from what humans say their values are?

Secondly, the model mis-specification problem seems much broader than value learning. Generally, we don’t know everything, so most (all?) of our models are mis-specified. But things still somehow work? Is the argument here that it won’t work in the extremes of intelligence?

Goals vs. utility functions

1: Intuitions about goal-directed behavior


AGI Ruin: A List of Lethalities


Section A: Why it’s a hard problem

Section B.1: The distributional leap

The alignment solution has to generalize outside of training.

Section B.2: Central difficulties of outer/inner alignment

Section B.3: Central difficulties of interpretability

Section B.4: Misc. unworkable schemes

Section C: AI safety research is flawed

DeepMind Alignment team on threat models


Clarifying AI X-risk


Map of AI x-risk

AI x-risks map from technical causes to paths to x-risk.

Technical causes:

  1. Specification gaming (SG), where bad feedback exists in the training loop (i.e. the reward is wrong).
  2. Goal mis-generalization (GMG), where the system performs well under training, but then acts in an out-of-distribution environment where the goal has failed to generalize.

Paths to x-risk:

  1. Interaction of multiple systems (IMS), where things go poorly due to the effects of complex interactions between systems.
  2. Mis-aligned power-seeking (MAPS), where a system seeks power to achieve its goals.

Opinion: The SG/GMG framing sounds like it maps quite nicely to inner/outer alignment. SG is when our reward function is wrong (we’ve failed to specify well, the cost function isn’t outer aligned) and GMG is when the system hasn’t properly learnt the cost function (it isn’t inner aligned). Why have they gone for this framing?

DeepMind Alignment team’s model

The DeepMind Alignment team believe that some combination of SG and GMG will lead to MAPS. Predicated on extinction due to AI, they believe the most likely cause will be:

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms


The “sharp left turn” is the claim that AI systems will get smart, fast, and this will break existing alignment proposals. This post breaks down & clarifies the claim:

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Post. Proposes a very high-level strategy for aligning a model during the sharp left turn.

  1. Align a model. Do this by detecting misalignment iteratively.
  2. Trust that the model’s values are propagated throughout the sharp left turn. This is an instrumental convergent goal. We can also try to keep it aligned.

Opinion: It feels like this post isn’t saying much… It’s main claim is that goals will survive the sharp left turn, and even this comes with a bunch of “who knows if it will!”. I guess this is just a refinement after all.

Ajeya Cotra’s AI takeover post

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.




  1. AGI is trained to be behaviourally safe.
  2. AGI becomes a great planner.
  3. AGI has great situational awareness.
  4. While humans are in control, AGI is incentivized to “play along” even if it has deceptive thoughts.
  5. When humans have no control, AGI is incentivized to take over.

Goodhard Taxonomy

Post. Say we have a true goal \(V\) and a proxy \(U\).

Value is Fragile

Post. Take “not being boring” as a human value. Most humans would say that a universe full of boring repetition is an awful one. But this value is not hardcoded anywhere, it’s just something evolution happened to stumble upon. This, taken with examples other than boredom, implies that our values are fragile: take one of them away, and you end up in a world we would think of as awful.

Inner and outer alignment decompose one hard problem into two extremely hard problems

Post. Claims that the inner/outer alignment framing isn’t productive. Loss functions don’t have to be exact, they “chisel cognitive grooves” into agents. We can see this quite clearly with LLMs where the cost function is relatively arbitrary, but the capabilities are diverse and the goal is unclear.

Opinion: This feels right, at least wrt. outer alignment. This makes me quite a bit more optimistic as the inner/outer alignment description made me a lot more pessimistic about techincal approaches to alignment. However, I’ve not read this post in detail.

Why I’m optimistic about our alignment approach (Jan Leike)


The ground of optimization


Instead of defining optimizers and optimizees separately, we define a singular optimising system.

An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.

Some attributes of an optimizing system:

There’s No Fire Alarm for Artificial General Intelligence


Alignment By Default


Public Static: What is Abstraction?

Post. Builds mathematical tools for reasoning about abstractions.

Mechanistic anomaly detection and ELK

Post. In ELK, we have to find out what a model knows in examples where we necessarily don’t have any training data. This post proposes anomaly detection: We do have training data for the “normal” examples, and we can see where the models computation differs substatially to see when something has changed (e.g. the diamond is missing).

Models Don’t “Get Reward”

Post. Makes the case that rewards shouldn’t be thought of as being “wanted” by models. Instead, it should be thought of as a way of selecting models.

Opinion: The two interpretations collapse when we select for models that “want” the reward - or something correlated with it, that’s the outer alignment problem! But otherwise I agree with this framing.

AI safety via market making

Post. Similar style to AI safety via debate. A model \(M\) predicts what a human will think about a question. A model \(Adv\) tries to provide information that will shift \(M\)’s prediction. \(Adv\) and \(M\) are invoked in turn until \(M\) converges.

Assumes that the \(Adv\) is myopic: If it lies in round \(t\), then in \(t+1\) it is incentivised to correct the lie to get the maximum movement in \(M\).

Open Problems with Myopia

Post. Outlines problems with myopia through a toy game: at every timestep, agents are given the option to press a button. If they press it, they get +1 reward, but get -10 reward next episode. We aim to design agents that are myopic and do press the button.

Risks from Learned Optimization in Advanced Machine Learning Systems

Paper. Introduces the idea of mesa-optimizers. These are optimizers that exist within a model. They are explicitly searching across a set of states to optimize for some goal.

This can be bad: It results in undintended optimization.

Steering GPT-2-XL by adding an activation vector

Post. You can add residual stream embeddings from one completion to another completion to help steer it. For example, if you add embedding("Love") - embedding("Hate") to a different completion’s residual stream, this makes the completion more positive.

Thoughts on sharing information about language model capabilities

Post. Argues that: 1. Accelerating LM agent research is neutral-to-positive as they’re interpretable by humans, and by default safer than making LMs larger – given a fixed capability level. 2. Public understanding of capabilities is positive as developers are less likely to be caught unaware.

Frontier Model Training report


Cost Breakdown of ML Training

Why ML GPUs Cost So Much

ML GPUs cost a lot more than gaming GPUs, despite gaming GPUs having lower $/FLOP. This is because they have a better memory traits: 10x more interconnect bandwidth, 2x more memory bandwidth, and 2x the memory size.

Contra FLOPs

Makes the case that FLOPs aren’t everything. For example, communication: Bandwidth to train GPT-4 was huge (more than all internet traffic in 2022), and the CHIPS act has compute limits that crippled H100 use in China.