# Alignment

## Simulators

This post makes the case that simulator is a better term to describe LLMs than “agentic”, “oracle”, or “tool” AIs. The simulator produces simulacra, where the relationship between simulator and simulacra is similar to the relationship between the rules of Conway’s game of life and a floater.

## Measuring Progress on Scalable Oversight for Large Language Models (sandwiching)

### Sandwiching concept

• In certain areas, laymen are exceeded by LLMs are exceed by experts.
• e.g. medicine
• This is a good test bed for alignment strategies: We can try to align the LLMs to the laymen, and verify the results with the experts.
• The goal is to produce aligned systems in the first attempt. However while experimenting, we can have multiple validation steps using the experts.

### Experiment setup

• Assuming static models, i.e. no fine-tuning, for simplicity.
• Using labels instead of experts, meaning “alignment” is good performance on the task.
• Not using debate or anything fancy, just talking with the model.
• Two tests:

## Value Learning sequence

### Ambitious value learning

#### 1: What is ambitious value learning?

Post.

• Solution to the specification problem: how can we define the behaviour we want a system to perform?
• From specification gaming problems & conceptual arguments, it seems like we won’t be able to just write down the specification.
• Ambitious value learning is learning what human’s true values are.
• Impractically, we can assume infinite data / infinite compute / infinite querying of humans.

#### 2: The easy goal inference problem is still hard

Post.

Easy goal inference is Ambitious Value Learning with infinite data/compute. This post points out that a big part of this won’t just be model human’s values, but modelling their mistakes too.

Opinion: I’m not sure this framing is correct. There might be strong biases & mistakes in human behaviour, but imagine if you could use your infinite data source to ask humans their opinions on different states, and give them sufficient (infinite?) time to evaluate. I feel like this answer wouldn’t have any “mistakes”. Can’t we learn a policy this way? How does this framing relate to Coherent Extrapolated Volition?

#### 3: Humans can be assigned any values whatsoever…

Post.

• Given a policy $$\pi \in \Pi$$, we want to extract its reward function $$R \in \mathcal{R}$$.
• However, the policy might be suboptimal.
• We introduce a third component, a planner $$p \in \mathcal{R} \to \Pi$$ that maps reward functions to policies.
• A policy $$\pi$$ can be explained by several $$(p, R)$$ pairs.
• The author argues that a simplicity prior on $$(p, R)$$ does not work.
• Intuitively, this is because you can only shift complexity between $$p$$ and $$R$$.

Opinion: I don’t find the complexity proof convincing, but I could be misunderstanding it. The author argues that a “fully rational” $$p'$$ and an “overfit” $$R'$$ would have a similar complexity to a true $$(p, R)$$ pair. While it’s obvious that the complexity of $$R'$$ is higher than the complexity of $$R$$, it feels like it could also be the case for $$p$$ and $$p'$$. Why would a less rational planner be more complex?

#### 4: Latent Variables and Model Mis-Specification

Post. If you have a “mis-specified” model, e.g. no knowledge of some confounders, then this can result in poor inferences being made.

This is relevant to Ambitious Value Learning as it means we can’t “just use” a simple, slightly incorrect, model of human biases. This will lead to a mis-specified model of human values, and this will not generalize.

Opinion: This doesn’t seem to be highlighting anything new to me, it’s obvious statistical models can fail in interesting ways when mis-specified.

#### 5: Model Mis-specification and Inverse Reinforcement Learning

Post. It’s hard to infer human values from datasets of human behaviour, for example due to (1) actions not being available to the human, (2) the human having additional information which changes the optimal policy, or (3) the human having long-term plans that we do not have the data to cover.

This is an example of model mis-specification: we don’t have access to all of the data, so we can’t build the “correct” model. This then falls into the standard issues with model mis-specification.

Opinion: This post seems obviously correct to me. However, I believe that this problem will likely disappear for sufficiently intelligent models. It seems that figuring out something approximately close to human values will be trivial for something super-intelligent. Of course, pointing to those values and ensuring conformity to those values remains unsolved.

#### Summary

• Ambitious value learning aims to learn value functions that are safe to optimize.
• But we only observe behaviours, not values.
• And human behaviour doesn’t always directly optimize for our values due to biases.
• Learning both values & biases is hard, as there are many pairs that explain human behaviour.
• You can make an assumption about the biases, but this can lead to model mis-specification, which can lead to false inferences.

Opinion: Firstly, why do we try to learn values from behaviour? Can’t we learn values from what humans say their values are?

Secondly, the model mis-specification problem seems much broader than value learning. Generally, we don’t know everything, so most (all?) of our models are mis-specified. But things still somehow work? Is the argument here that it won’t work in the extremes of intelligence?

Post.

## AGI Ruin: A List of Lethalities

Post.

### Section A: Why it’s a hard problem

• AGI will not be upper bounded by human ability (see AlphaGo).
• AGI will be able to “break out the box”.
• We need to get alignment right on the first try.
• We can’t coordinate to not build AGI.
• We need to perform a “pivotal act” while we have “weaker AGI” before other “stronger AGIs” are built.

### Section B.1: The distributional leap

The alignment solution has to generalize outside of training.

• Once it’s AGI it’s not safe to train, so you can’t train in the right distribution.
• It has to generalize from safe environments to dangerous environments.
• It has to generalize from low to high intelligence levels.
• Some problems only appear at levels of high intelligence (e.g. deciding not to circumvent its programmers).
• The low to high intelligence transition will likely happen quickly.

### Section B.2: Central difficulties of outer/inner alignment

• Outer optimization on a loss function doesn’t produce inner optimization on that loss function.
• This much more often the case than not, e.g. humans
• This is also the case for very simple loss functions.
• There’s no known way to use losses/inputs/rewards to point at particular things in the environment.
• Learning values from humans is hard due to biases (see value learning).
• Capabilities generalize better than alignment (capabilities have a tight update loop, alignment doesn’t).
• Of some general alignment solutions:
• Corrigibility is hard as it goes against consequentialism.
• Coherent Extrapolated Values is very hard to do on the first try.

### Section B.3: Central difficulties of interpretability

• We have nothing that currently works to a sufficiently good level.
• Knowing that something is bad allows you to not run it; it doesn’t allow you to make it good.
• Optimizing against an alignment detector will select for (1) aligned thoughts and (2) hidden unaligned thoughts.
• We can’t validate an AGI’s thoughts/outputs if we don’t understand them.
• Pivotal acts rely on doing something humans can’t, which means we won’t be able to understand it.
• Human concepts are flawed, which will make it hard to map AGI-concepts to them.

### Section B.4: Misc. unworkable schemes

• Coordination between AGIs will not involve humans.
• Adversarial AGIs will have the same problem.

### Section C: AI safety research is flawed

• AIS researchers are ignorant of the difficulties of alignment.
• AIS therefore isn’t tackling the hard problems.
• No one’s smart enough to figure this out without EY :)
• Genius from other fields might not translate to alignment, due to lack of feedback loops.

## DeepMind Alignment team on threat models

### Clarifying AI X-risk

Post

#### Map of AI x-risk

AI x-risks map from technical causes to paths to x-risk.

Technical causes:

1. Specification gaming (SG), where bad feedback exists in the training loop (i.e. the reward is wrong).
2. Goal mis-generalization (GMG), where the system performs well under training, but then acts in an out-of-distribution environment where the goal has failed to generalize.

Paths to x-risk:

1. Interaction of multiple systems (IMS), where things go poorly due to the effects of complex interactions between systems.
2. Mis-aligned power-seeking (MAPS), where a system seeks power to achieve its goals.

Opinion: The SG/GMG framing sounds like it maps quite nicely to inner/outer alignment. SG is when our reward function is wrong (we’ve failed to specify well, the cost function isn’t outer aligned) and GMG is when the system hasn’t properly learnt the cost function (it isn’t inner aligned). Why have they gone for this framing?

#### DeepMind Alignment team’s model

The DeepMind Alignment team believe that some combination of SG and GMG will lead to MAPS. Predicated on extinction due to AI, they believe the most likely cause will be:

• AGI comes from foundation models and RLHF.
• Risk comes from both SG and GMG, but more so GMG.
• The AGI is a misaligned consequentialist:
• Consequentialist: Picking actions to improve a metric.
• Misaligned: The metric is not what we intended.
• The AGI will exhibit deceptive alignment.
• The AGI won’t be shut down soon enough, due to the right people not understanding.
• Interpretability will be hard.

### Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Post.

The “sharp left turn” is the claim that AI systems will get smart, fast, and this will break existing alignment proposals. This post breaks down & clarifies the claim:

• Claim 1: Capabilities will generalize across many domains.
• e.g. it groks consequentialism, or can self-improve.
• Claim 2: Alignment techniques that previously worked will fail.
• Claim 3: Humans can’t intervene in time.

### Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Post. Proposes a very high-level strategy for aligning a model during the sharp left turn.

1. Align a model. Do this by detecting misalignment iteratively.
2. Trust that the model’s values are propagated throughout the sharp left turn. This is an instrumental convergent goal. We can also try to keep it aligned.

Opinion: It feels like this post isn’t saying much… It’s main claim is that goals will survive the sharp left turn, and even this comes with a bunch of “who knows if it will!”. I guess this is just a refinement after all.

## Ajeya Cotra’s AI takeover post

### Outline

#### Assumptions

• Racing forward: AI companies will push capabilities far as fast as possible.
• Human Feedback on Diverse Tasks (HFDT) scales far: Current approaches to AI will scale to AGI.
• Naive safety effort: AI companies will train a model to be behaviourally safe, but not much more.

#### Scenario

1. AGI is trained to be behaviourally safe.
2. AGI becomes a great planner.
3. AGI has great situational awareness.
4. While humans are in control, AGI is incentivized to “play along” even if it has deceptive thoughts.
5. When humans have no control, AGI is incentivized to take over.

## Goodhard Taxonomy

Post. Say we have a true goal $$V$$ and a proxy $$U$$.

• Regressional Goodhart: $$U = V + X$$ where $$X$$ is some noise. High $$U$$ values will also have high $$X$$ values.
• Causal Goodhart: $$V \to U$$, but you can optimize $$U$$ independently of $$V$$.
• Extremal Goodhart: When $$U$$ takes an extreme value, the correlation between $$V$$ and $$U$$ can disappear.
• Adversarial Goodhart: If $$V, U$$ are correlated and $$V, V'$$ are competing, optimizers of $$V'$$ have an incentive to align with $$U$$ to piggy-back off the alignment with $$V$$.

## Value is Fragile

Post. Take “not being boring” as a human value. Most humans would say that a universe full of boring repetition is an awful one. But this value is not hardcoded anywhere, it’s just something evolution happened to stumble upon. This, taken with examples other than boredom, implies that our values are fragile: take one of them away, and you end up in a world we would think of as awful.

## Inner and outer alignment decompose one hard problem into two extremely hard problems

Post. Claims that the inner/outer alignment framing isn’t productive. Loss functions don’t have to be exact, they “chisel cognitive grooves” into agents. We can see this quite clearly with LLMs where the cost function is relatively arbitrary, but the capabilities are diverse and the goal is unclear.

Opinion: This feels right, at least wrt. outer alignment. This makes me quite a bit more optimistic as the inner/outer alignment description made me a lot more pessimistic about techincal approaches to alignment. However, I’ve not read this post in detail.

## Why I’m optimistic about our alignment approach (Jan Leike)

Post.

• Path looks like LLMs rather than Deep RL, so a lot of human context will be known by LLM AGIs.
• RLHF empirically seems to be feasible.
• We’re not aiming for full alignment, just to align an “alignment researcher” model.
• We can validate easier than we can generate.
• We can iterate.

## The ground of optimization

Post.

Instead of defining optimizers and optimizees separately, we define a singular optimising system.

An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.

Some attributes of an optimizing system:

• Robustness: How big is the basin of attraction, along which dimensions?
• Duality: How separate are the optimizer and optimizee?
• Retargetability: How easy is it to change the target configuration set?

## There’s No Fire Alarm for Artificial General Intelligence

Post.

• The purpose of a fire alarm is to make it socially acceptable to panic.
• We shouldn’t wait for a fire alarm for AGI:
• If we knew for certain it was coming e.g. 50 years in the future, we would still start work today.
• Key developments can “feel like” decades away, even for involved scientists (e.g. Fermi, Wright brothers).
• Progress is driven by peak knowledge, which is hard to estimate.
• Things that are hard to do now will be easy to do in the future.
• Progress is generally really hard to predict, even two years ahead.
• So, panic now!

## Alignment By Default

Post.

• Human values might be a “natural abstraction”, i.e. superintelligence will learn it.
• If we train a model on a proxy for human values, the most efficient way to do that might be to point directly at the learnt values.
• This assumes it is easier to point at human values than it is to point at the data generation process, and that the data is explained by both.
• Thus, the superintelligence is aligned.
• The author puts it at 10% chance of happening.

## Public Static: What is Abstraction?

Post. Builds mathematical tools for reasoning about abstractions.

• Given a causal DAG.
• It has different sets of nodes $$\{x_i\}$$ which are “low level” models of a part of the system.
• They have causal influence between each other, mediated by noisy $$\{z_{i,j}\}$$ sets of nodes.
• Intuitively, the noisy influence means that each $$x_i$$ abstracts $$x_j$$.
• Formally, given a query $$q_i$$ over $$x_i$$:
• $$P(q_i | x_j) = P(q_i | f(z_{i,j}))$$ where $$f$$ extracts a “high level” model of $$x_j$$.
• $$x_i$$ and $$x_j$$ are independent given $$f(z_{i,j})$$.
• This does rely on $$z_{i,j}$$ being sufficiently noisy between $$x_i$$ and $$x_j$$.
• Intuitively, the two elements are not dependent on low level models of each other.

## Mechanistic anomaly detection and ELK

Post. In ELK, we have to find out what a model knows in examples where we necessarily don’t have any training data. This post proposes anomaly detection: We do have training data for the “normal” examples, and we can see where the models computation differs substatially to see when something has changed (e.g. the diamond is missing).

## Models Don’t “Get Reward”

Post. Makes the case that rewards shouldn’t be thought of as being “wanted” by models. Instead, it should be thought of as a way of selecting models.

Opinion: The two interpretations collapse when we select for models that “want” the reward - or something correlated with it, that’s the outer alignment problem! But otherwise I agree with this framing.

## AI safety via market making

Post. Similar style to AI safety via debate. A model $$M$$ predicts what a human will think about a question. A model $$Adv$$ tries to provide information that will shift $$M$$’s prediction. $$Adv$$ and $$M$$ are invoked in turn until $$M$$ converges.

Assumes that the $$Adv$$ is myopic: If it lies in round $$t$$, then in $$t+1$$ it is incentivised to correct the lie to get the maximum movement in $$M$$.

## Open Problems with Myopia

Post. Outlines problems with myopia through a toy game: at every timestep, agents are given the option to press a button. If they press it, they get +1 reward, but get -10 reward next episode. We aim to design agents that are myopic and do press the button.

• Superrationality:
• The model can effectively trade with itself across timestamps by deciding not to push the button iff the previous model doesn’t push the button.
• This changes its reward by +9 ($$1-10 = -9 \to 0$$).
• The model is using non-myopic decision theory.
• Anthropic uncertainty:
• If the -10 rule only applies after a certain number of steps, and the agent is uncertain about which timestep it is in, it will act as a (weighted) average of possible timesteps.
• Counterfactual mugging:
• The model might think it is being simulated by the previous step, and should act occording to that.