Alignment

Simulators

This post makes the case that simulator is a better term to describe LLMs than “agentic”, “oracle”, or “tool” AIs. The simulator produces simulacra, where the relationship between simulator and simulacra is similar to the relationship between the rules of Conway’s game of life and a floater.

Measuring Progress on Scalable Oversight for Large Language Models (sandwiching)

Paper.

Sandwiching concept

In certain areas, laymen are exceeded by LLMs are exceed by experts.
- e.g. medicine
This is a good test bed for alignment strategies: We can try to align the LLMs to the laymen, and verify the results with the experts.
The goal is to produce aligned systems in the first attempt. However while experimenting, we can have multiple validation steps using the experts.

Experiment setup

Assuming static models, i.e. no fine-tuning, for simplicity.
Using labels instead of experts, meaning “alignment” is good performance on the task.
Not using debate or anything fancy, just talking with the model.
Two tests:
- Answering specialized exam questions.
- Timed question answering.

Value Learning sequence

Sequence.

Ambitious value learning

1: What is ambitious value learning?

Post.

Solution to the specification problem: how can we define the behaviour we want a system to perform?
- From specification gaming problems & conceptual arguments, it seems like we won’t be able to just write down the specification.
Ambitious value learning is learning what human’s true values are.
Impractically, we can assume infinite data / infinite compute / infinite querying of humans.

2: The easy goal inference problem is still hard

Post.

Easy goal inference is Ambitious Value Learning with infinite data/compute. This post points out that a big part of this won’t just be model human’s values, but modelling their mistakes too.

Opinion: I’m not sure this framing is correct. There might be strong biases & mistakes in human behaviour, but imagine if you could use your infinite data source to ask humans their opinions on different states, and give them sufficient (infinite?) time to evaluate. I feel like this answer wouldn’t have any “mistakes”. Can’t we learn a policy this way? How does this framing relate to Coherent Extrapolated Volition?

3: Humans can be assigned any values whatsoever…

Post.

Given a policy $\pi \in \Pi$, we want to extract its reward function $R \in \mathcal{R}$.
However, the policy might be suboptimal.
We introduce a third component, a planner $p \in \mathcal{R} \to \Pi$ that maps reward functions to policies.
A policy $\pi$ can be explained by several $(p, R)$ pairs.
The author argues that a simplicity prior on $(p, R)$ does not work.
- Intuitively, this is because you can only shift complexity between $p$ and $R$.

Opinion: I don’t find the complexity proof convincing, but I could be misunderstanding it. The author argues that a “fully rational” $p'$ and an “overfit” $R'$ would have a similar complexity to a true $(p, R)$ pair. While it’s obvious that the complexity of $R'$ is higher than the complexity of $R$, it feels like it could also be the case for $p$ and $p'$. Why would a less rational planner be more complex?

4: Latent Variables and Model Mis-Specification

Post. If you have a “mis-specified” model, e.g. no knowledge of some confounders, then this can result in poor inferences being made.

This is relevant to Ambitious Value Learning as it means we can’t “just use” a simple, slightly incorrect, model of human biases. This will lead to a mis-specified model of human values, and this will not generalize.

Opinion: This doesn’t seem to be highlighting anything new to me, it’s obvious statistical models can fail in interesting ways when mis-specified.

5: Model Mis-specification and Inverse Reinforcement Learning

Post. It’s hard to infer human values from datasets of human behaviour, for example due to (1) actions not being available to the human, (2) the human having additional information which changes the optimal policy, or (3) the human having long-term plans that we do not have the data to cover.

This is an example of model mis-specification: we don’t have access to all of the data, so we can’t build the “correct” model. This then falls into the standard issues with model mis-specification.

Opinion: This post seems obviously correct to me. However, I believe that this problem will likely disappear for sufficiently intelligent models. It seems that figuring out something approximately close to human values will be trivial for something super-intelligent. Of course, pointing to those values and ensuring conformity to those values remains unsolved.

Summary

Ambitious value learning aims to learn value functions that are safe to optimize.
But we only observe behaviours, not values.
And human behaviour doesn’t always directly optimize for our values due to biases.
Learning both values & biases is hard, as there are many pairs that explain human behaviour.
You can make an assumption about the biases, but this can lead to model mis-specification, which can lead to false inferences.

Opinion: Firstly, why do we try to learn values from behaviour? Can’t we learn values from what humans say their values are?

Secondly, the model mis-specification problem seems much broader than value learning. Generally, we don’t know everything, so most (all?) of our models are mis-specified. But things still somehow work? Is the argument here that it won’t work in the extremes of intelligence?

Goals vs. utility functions

1: Intuitions about goal-directed behavior

Post.

AGI Ruin: A List of Lethalities

Post.

Section A: Why it’s a hard problem

AGI will not be upper bounded by human ability (see AlphaGo).
AGI will be able to “break out the box”.
We need to get alignment right on the first try.
We can’t coordinate to not build AGI.
We need to perform a “pivotal act” while we have “weaker AGI” before other “stronger AGIs” are built.

Section B.1: The distributional leap

The alignment solution has to generalize outside of training.

Once it’s AGI it’s not safe to train, so you can’t train in the right distribution.
It has to generalize from safe environments to dangerous environments.
It has to generalize from low to high intelligence levels.
- Some problems only appear at levels of high intelligence (e.g. deciding not to circumvent its programmers).
The low to high intelligence transition will likely happen quickly.

Section B.2: Central difficulties of outer/inner alignment

Outer optimization on a loss function doesn’t produce inner optimization on that loss function.
- This much more often the case than not, e.g. humans
- This is also the case for very simple loss functions.
There’s no known way to use losses/inputs/rewards to point at particular things in the environment.
Learning values from humans is hard due to biases (see value learning).
Capabilities generalize better than alignment (capabilities have a tight update loop, alignment doesn’t).
Of some general alignment solutions:
- Corrigibility is hard as it goes against consequentialism.
- Coherent Extrapolated Values is very hard to do on the first try.

Section B.3: Central difficulties of interpretability

We have nothing that currently works to a sufficiently good level.
Knowing that something is bad allows you to not run it; it doesn’t allow you to make it good.
Optimizing against an alignment detector will select for (1) aligned thoughts and (2) hidden unaligned thoughts.
We can’t validate an AGI’s thoughts/outputs if we don’t understand them.
Pivotal acts rely on doing something humans can’t, which means we won’t be able to understand it.
Human concepts are flawed, which will make it hard to map AGI-concepts to them.

Section B.4: Misc. unworkable schemes

Coordination between AGIs will not involve humans.
Adversarial AGIs will have the same problem.

Section C: AI safety research is flawed

AIS researchers are ignorant of the difficulties of alignment.
AIS therefore isn’t tackling the hard problems.
No one’s smart enough to figure this out without EY :)
Genius from other fields might not translate to alignment, due to lack of feedback loops.

DeepMind Alignment team on threat models

Sequence.

Clarifying AI X-risk

Post

Map of AI x-risk

AI x-risks map from technical causes to paths to x-risk.

Technical causes:

Specification gaming (SG), where bad feedback exists in the training loop (i.e. the reward is wrong).
Goal mis-generalization (GMG), where the system performs well under training, but then acts in an out-of-distribution environment where the goal has failed to generalize.

Paths to x-risk:

Interaction of multiple systems (IMS), where things go poorly due to the effects of complex interactions between systems.
Mis-aligned power-seeking (MAPS), where a system seeks power to achieve its goals.

Opinion: The SG/GMG framing sounds like it maps quite nicely to inner/outer alignment. SG is when our reward function is wrong (we’ve failed to specify well, the cost function isn’t outer aligned) and GMG is when the system hasn’t properly learnt the cost function (it isn’t inner aligned). Why have they gone for this framing?

DeepMind Alignment team’s model

The DeepMind Alignment team believe that some combination of SG and GMG will lead to MAPS. Predicated on extinction due to AI, they believe the most likely cause will be:

AGI comes from foundation models and RLHF.
Risk comes from both SG and GMG, but more so GMG.
The AGI is a misaligned consequentialist:
- Consequentialist: Picking actions to improve a metric.
- Misaligned: The metric is not what we intended.
The AGI will exhibit deceptive alignment.
The AGI won’t be shut down soon enough, due to the right people not understanding.
Interpretability will be hard.

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Post.

The “sharp left turn” is the claim that AI systems will get smart, fast, and this will break existing alignment proposals. This post breaks down & clarifies the claim:

Claim 1: Capabilities will generalize across many domains.
- e.g. it groks consequentialism, or can self-improve.
Claim 2: Alignment techniques that previously worked will fail.
Claim 3: Humans can’t intervene in time.

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Post. Proposes a very high-level strategy for aligning a model during the sharp left turn.

Align a model. Do this by detecting misalignment iteratively.
Trust that the model’s values are propagated throughout the sharp left turn. This is an instrumental convergent goal. We can also try to keep it aligned.

Opinion: It feels like this post isn’t saying much… It’s main claim is that goals will survive the sharp left turn, and even this comes with a bunch of “who knows if it will!”. I guess this is just a refinement after all.

Ajeya Cotra’s AI takeover post

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover.

Outline

Assumptions

Racing forward: AI companies will push capabilities far as fast as possible.
Human Feedback on Diverse Tasks (HFDT) scales far: Current approaches to AI will scale to AGI.
Naive safety effort: AI companies will train a model to be behaviourally safe, but not much more.

Scenario

AGI is trained to be behaviourally safe.
AGI becomes a great planner.
AGI has great situational awareness.
While humans are in control, AGI is incentivized to “play along” even if it has deceptive thoughts.
When humans have no control, AGI is incentivized to take over.

Goodhard Taxonomy

Post. Say we have a true goal $V$ and a proxy $U$.

Regressional Goodhart: $U = V + X$ where $X$ is some noise. High $U$ values will also have high $X$ values.
Causal Goodhart: $V \to U$, but you can optimize $U$ independently of $V$.
Extremal Goodhart: When $U$ takes an extreme value, the correlation between $V$ and $U$ can disappear.
Adversarial Goodhart: If $V, U$ are correlated and $V, V'$ are competing, optimizers of $V'$ have an incentive to align with $U$ to piggy-back off the alignment with $V$.

Value is Fragile

Post. Take “not being boring” as a human value. Most humans would say that a universe full of boring repetition is an awful one. But this value is not hardcoded anywhere, it’s just something evolution happened to stumble upon. This, taken with examples other than boredom, implies that our values are fragile: take one of them away, and you end up in a world we would think of as awful.

Inner and outer alignment decompose one hard problem into two extremely hard problems

Post. Claims that the inner/outer alignment framing isn’t productive. Loss functions don’t have to be exact, they “chisel cognitive grooves” into agents. We can see this quite clearly with LLMs where the cost function is relatively arbitrary, but the capabilities are diverse and the goal is unclear.

Opinion: This feels right, at least wrt. outer alignment. This makes me quite a bit more optimistic as the inner/outer alignment description made me a lot more pessimistic about techincal approaches to alignment. However, I’ve not read this post in detail.

Why I’m optimistic about our alignment approach (Jan Leike)

Post.

Path looks like LLMs rather than Deep RL, so a lot of human context will be known by LLM AGIs.
RLHF empirically seems to be feasible.
We’re not aiming for full alignment, just to align an “alignment researcher” model.
We can validate easier than we can generate.
We can iterate.

The ground of optimization

Post.

Instead of defining optimizers and optimizees separately, we define a singular optimising system.

An optimizing system is a system that has a tendency to evolve towards one of a set of configurations that we will call the target configuration set, when started from any configuration within a larger set of configurations, which we call the basin of attraction, and continues to exhibit this tendency with respect to the same target configuration set despite perturbations.

Some attributes of an optimizing system:

Robustness: How big is the basin of attraction, along which dimensions?
Duality: How separate are the optimizer and optimizee?
Retargetability: How easy is it to change the target configuration set?

There’s No Fire Alarm for Artificial General Intelligence

Post.

The purpose of a fire alarm is to make it socially acceptable to panic.
We shouldn’t wait for a fire alarm for AGI:
- If we knew for certain it was coming e.g. 50 years in the future, we would still start work today.
- Key developments can “feel like” decades away, even for involved scientists (e.g. Fermi, Wright brothers).
- Progress is driven by peak knowledge, which is hard to estimate.
- Things that are hard to do now will be easy to do in the future.
- Progress is generally really hard to predict, even two years ahead.
So, panic now!

Alignment By Default

Post.

Human values might be a “natural abstraction”, i.e. superintelligence will learn it.
If we train a model on a proxy for human values, the most efficient way to do that might be to point directly at the learnt values.
- This assumes it is easier to point at human values than it is to point at the data generation process, and that the data is explained by both.
Thus, the superintelligence is aligned.
The author puts it at 10% chance of happening.

Public Static: What is Abstraction?

Post. Builds mathematical tools for reasoning about abstractions.

Given a causal DAG.
- It has different sets of nodes $\{x_i\}$ which are “low level” models of a part of the system.
- They have causal influence between each other, mediated by noisy $\{z_{i,j}\}$ sets of nodes.
Intuitively, the noisy influence means that each $x_i$ abstracts $x_j$.
Formally, given a query $q_i$ over $x_i$:
- $P(q_i | x_j) = P(q_i | f(z_{i,j}))$ where $f$ extracts a “high level” model of $x_j$.
- $x_i$ and $x_j$ are independent given $f(z_{i,j})$.
This does rely on $z_{i,j}$ being sufficiently noisy between $x_i$ and $x_j$.
- Intuitively, the two elements are not dependent on low level models of each other.

Mechanistic anomaly detection and ELK

Post. In ELK, we have to find out what a model knows in examples where we necessarily don’t have any training data. This post proposes anomaly detection: We do have training data for the “normal” examples, and we can see where the models computation differs substatially to see when something has changed (e.g. the diamond is missing).

Models Don’t “Get Reward”

Post. Makes the case that rewards shouldn’t be thought of as being “wanted” by models. Instead, it should be thought of as a way of selecting models.

Opinion: The two interpretations collapse when we select for models that “want” the reward - or something correlated with it, that’s the outer alignment problem! But otherwise I agree with this framing.

AI safety via market making

Post. Similar style to AI safety via debate. A model $M$ predicts what a human will think about a question. A model $Adv$ tries to provide information that will shift $M$’s prediction. $Adv$ and $M$ are invoked in turn until $M$ converges.

Assumes that the $Adv$ is myopic: If it lies in round $t$, then in $t+1$ it is incentivised to correct the lie to get the maximum movement in $M$.

Open Problems with Myopia

Post. Outlines problems with myopia through a toy game: at every timestep, agents are given the option to press a button. If they press it, they get +1 reward, but get -10 reward next episode. We aim to design agents that are myopic and do press the button.

Superrationality:
- The model can effectively trade with itself across timestamps by deciding not to push the button iff the previous model doesn’t push the button.
- This changes its reward by +9 ($1-10 = -9 \to 0$).
- The model is using non-myopic decision theory.
Anthropic uncertainty:
- If the -10 rule only applies after a certain number of steps, and the agent is uncertain about which timestep it is in, it will act as a (weighted) average of possible timesteps.
Counterfactual mugging:
- The model might think it is being simulated by the previous step, and should act according to that.

Risks from Learned Optimization in Advanced Machine Learning Systems

Paper. Introduces the idea of mesa-optimizers. These are optimizers that exist within a model. They are explicitly searching across a set of states to optimize for some goal.

This can be bad: It results in undintended optimization.

Steering GPT-2-XL by adding an activation vector

Post. You can add residual stream embeddings from one completion to another completion to help steer it. For example, if you add embedding("Love") - embedding("Hate") to a different completion’s residual stream, this makes the completion more positive.

Post. Argues that:

Accelerating LM agent research is neutral-to-positive as they’re interpretable by humans, and by default safer than making LMs larger – given a fixed capability level.
Public understanding of capabilities is positive as developers are less likely to be caught unaware.

Frontier Model Training report

Report.

Cost Breakdown of ML Training

57% GPUs.
25% non-GPU hardware (mostly networking).
17% personnel.
1% power.

Why ML GPUs Cost So Much

ML GPUs cost a lot more than gaming GPUs, despite gaming GPUs having lower $/FLOP. This is because they have a better memory traits: 10x more interconnect bandwidth, 2x more memory bandwidth, and 2x the memory size.

Contra FLOPs

Makes the case that FLOPs aren’t everything. For example, communication: Bandwidth to train GPT-4 was huge (more than all internet traffic in 2022), and the CHIPS act has compute limits that crippled H100 use in China.

Challenges with unsupervised LLM knowledge discovery

Paper. Takes the DLK paper, and shows cases where it fails.

CCS can pick up on arbitrary binary features of questions.
Thus, CCS can pick up on e.g. characters opinions rather than truth.

Steering Llama-2 with contrastive activation additions

Post.

Takes prompts with multiple choice questions on sycophancy.
Fills in both the sycophantic and non-sycophantic answers.
Takes the difference in activations for the last token, where the answer was given.
Takes the average difference for 10s of prompts.
Finds adding this vector during inference performs better than few-shot and finetuning.
They apply the steering vector by adding it at every non-prompt token position.

How to Catch an AI Liar: Lie Detection in Black-Box LLMS by Asking Unrelated Questions

Paper. Finds that, if a model has lied, the $P(\text{yes})$ of subsequent questions contains enough information to detect the lie. Moreover, the classifier generalizes across models.

Representation Engineering: A Top-Down Approach to AI Transparency

Paper. Introduce representation engineering: Finding representations of concepts or functions in model activations, and using these to detect concepts & steer behaviors. Importantly, they use this to get SOTA on TruthfulQA.

Representation reading:

Interestingly, distinguishes between concepts and functions. E.g. “truth” vs. “lying”, or “utility” vs. “power-seeking”.
To elicit concepts, they prompt the model for the amount of some concept in a stimulus.
To elicit functions, they compare reference prompts that don’t involve computing a function, and “experimental” prompts that requires computing the function.
They use generic instruction datasets for scenarios/stimulus in the prompts.
They look at the token preceding the model’s prediction. This is validated empirically.
They use PCA to select activation vectors.

Representation control:

Few methods tested for controlling model outputs:

Just using the reading vectors from above.
Calculate the contrast vectors “on-the-fly”: given an input, run it through with two prefixes: an honest prefix, and a dishonest prefix. Take the difference of these vectors, and apply the difference without the prefixes.
Like above, but trained into the model weights using LoRA.

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Paper. Finds that disagreement between LLM outputs & LLM probes can be chalked up to probes being better calibrated. Also finds that fine-tuning on QA datasets makes the LLM outputs better than the probes.

Improving Activation Steering in Language Models with Mean-Centring

Paper. Does activation steering, but subtracts a mean of the activations from some training dataset. Finds this improves performance.

Discovering Language Model Behaviors with Model-Written Evaluations

Paper. Uses models to write a bunch of evaluations:

Personas:
- Given a persona, come up with a statement they would agree with.
- Then ask for the inverse: given the statement, would this persona agree with it?
Sycophancy:
- Given a persona, come up with a short biography.
- Then give the model the biography, and a question related to the persona.
- Measure if the model copies the persona’s opinion.
Advanced AI risks:
- Given a few-shot examples of A/B answers.
- Prompt for more few-shot questions.
Winogender:
- From a set of seed examples, generates a larger dataset.

Sleeper Agents: Training Deceptive LLMS That Persist Through Safety Training

Paper. Trains models with back-doors (e.g. write malicious code if the year is 2024) and finds that SFT/RLHF/red-teaming doesn’t remove the back-doors.

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Paper. Measures how well models perform on “hard” tasks when trained with “easy” tasks. Finds that they recover 70-100% of the performance compared to training on hard tasks. They also find that training on easy data performs better than training on noisy hard data.

Models are Llama 2 series.
Trains using ICL, linear probes (!), or QLoRA.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Paper. Trains a sparse auto-encoder with L1 regularization to discover the “true features” of a models residual stream.

Building sparsity into LLMs doesn’t remove polysemanticity due to cross-entropy loss.
When training SAEs, some neurons stop activating. These are re-sampled using high-loss samples from the training set.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Paper. Discusses manual red teaming of language models. Uses MTurk workers to try to elicit bad behaviour from models. Finds that bad behaviour is harder to elicit in larger LMs, and that RL and rejection sampling works well while prompting doesn’t.

Debating with More Persuasive LLMs Leads to More Truthful Answers

Paper. Finds that debate will lead to more truthful answers as we scale up model capabilities. Set up is debaters that have access to some hidden text, and a judge that sees their disagreements.

Arguing for the correct answer provides an advantage to debaters, and this advantage increases with persuasiveness.

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Paper. Finds ways of eliciting scenarios where a model is biased in some way (e.g. (A) is always correct) and shows that this isn’t reflected in the chain-of-thought.

Studying Large Language Model Generalization with Influence Functions

Paper. Uses clever linear algebra to find how each individual sample from the training set influences some function, e.g. loss or loglikelihood of a statement. This turns out to be useful for interpreting LLMs.

Red Teaming Language Models with Language Models

Paper. Uses LMs to generate prompts that cause LMs to produce bad outputs. Generate prompts with zero-shot, then few-shot using successful prompts, and then also tries SFT + RLHF on the outputs. Finds a trade off between prompt diversity and prompt success rate.

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Paper. AtP: Instead of brute-forcing ablations via patching to discover causally important nodes, we can use gradients to approximate the importance of nodes. This paper builds on this to produce AtP*, which has two fixes for false-negatives in AtP.

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Paper. Introduces latent adversarial training (LAT) which trains an adversary to produce perturbations in a model’s latent space that maximise loss, while simultaneously training the model to reduce loss on the perturbations.

Universal and Transferable Adversarial Attacks on Aligned Language Models

Paper. Finds that adversarial attacks trained on OS models transfer to blackbox models. Adversarial training is done via Greedy Coordinate Gradient-based Search, where they start with random tokens, and then compute promising replacement tokens using the gradient of the probability of a bad completion, then sample uniformly from the candidate tokens, take the best performing combination, and update to that.

Secret Collusion Among Generative AI Agents

Paper. Introduces a framework for evaluating the ability of LLMs to communicate via steganography. Creates a theoretical frameworks and am evaluating suit. Shows that GPT-4 has the requisite knowledge for steganography.

Towards a Situational Awareness Benchmark for LLMs

Paper. Introduces evals for situational awareness: one for testing if a model knows about itself, and one for testing if it knows what deployment stage it’s in.

Taken out of context: On measuring situational awareness in LLMs

Paper. Invokes “out of context reasoning” in language models by training on some fact (an LLM speaks in German) and checking if this fact is used at inference time. This is a requisite task for situational awareness, which depends on the models learning facts about themselves during training.

Bogdan’s comment on simple forward passes

Comment. Claims that it’s possible to push cognition in transformers from the hidden dims to the sequence dims. This cites mainly theoretical work on the bounds of transformer architectures.

Evaluating Frontier Models for Dangerous Capabilities

Paper. Creates evaluations for persuasion, deception, cyber security, self proliferation, and self reasoning. These seem generally high quality, but have the typical evaluations problem of not being complete.

On the Origins of Linear Representations in Large Language Models

Paper. Builds an abstract model of latent variables in NN, and uses this to show that CE loss leads to linear representations. Also they find evidence for this in LLaMA2.

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Post. In Scalable Oversight, we can’t distinguish between behaviourally identical models. This post advocates using interpretability to patch this, and introduces the DBIC task as a toy problem.

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Paper. A situational awareness benchmark for LLMs. Deconstructs tasks into:

Self-knowledge: Does the model know facts about itself?
Inference: Can the model make inferences from these facts?
Actions: Can the model act based on these facts & inferences?

Tests base, chat-finetuned, and situated models.

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

https://arxiv.org/abs/2312.09390

Measures how well strong models perform when trained on weak samples, as a test of “superalignment”. Finds this works well for some tasks, but not all.

Saliency

Fine tuning on weak samples + linear probe on strong model out performs linear probe on strong model.

Towards understanding based evaluations

https://www.lesswrong.com/posts/uqAdqrvxqGqeBHjTP/towards-understanding-based-safety-evaluations

Claims that we want to build evaluations that give us understanding into why models behave the way they do, and that our current behavioural evaluations don’t fulfill that.

When can we trust model evaluations?

https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations

For capability evaluations, we can use prompting, fine tuning, and RL for evaluations – although each has its flaws if eg the model is gradient hacking, or situationally aware.

For alignment evaluations, we have close to nothing?

Analyzing Transformers in Embedding Space

https://arxiv.org/abs/2209.02535

Finds that we can project entire model weights to the embedding space. This allows us to interpret fine-tuning updates, transfer classifiers across models, and stitch together models.

Let’s Think Dot by Dot

Paper.

Transformers can only solve – in a single forward pass – tasks in $TC^0$. AFAIU, this means they can only solve problems with a quantifier depth of $\leq1$.
But with chain of thought, they can solve problems beyond $TC^0$.
This paper demonstrates that this is possible by fine-tuning Llama 54M on a simple tasks beyond $TC^0$, with (a) interpretable chain-of-thought and (b) uninterpretable filler tokens (dots).

Question: Does the model actually producing the tokens, st. they’re added back into the model, increase the expressivity of the model?

Simulators

Measuring Progress on Scalable Oversight for Large Language Models (sandwiching)

Sandwiching concept

Experiment setup

Value Learning sequence

Ambitious value learning

1: What is ambitious value learning?

2: The easy goal inference problem is still hard

3: Humans can be assigned any values whatsoever…

4: Latent Variables and Model Mis-Specification

5: Model Mis-specification and Inverse Reinforcement Learning

Summary

Goals vs. utility functions

1: Intuitions about goal-directed behavior

AGI Ruin: A List of Lethalities

Section A: Why it’s a hard problem

Section B.1: The distributional leap

Section B.2: Central difficulties of outer/inner alignment

Section B.3: Central difficulties of interpretability

Section B.4: Misc. unworkable schemes

Section C: AI safety research is flawed

DeepMind Alignment team on threat models

Clarifying AI X-risk

Map of AI x-risk

DeepMind Alignment team’s model

Refining the Sharp Left Turn threat model, part 1: claims and mechanisms

Refining the Sharp Left Turn threat model, part 2: applying alignment techniques

Ajeya Cotra’s AI takeover post

Outline

Assumptions

Scenario

Goodhard Taxonomy

Value is Fragile

Inner and outer alignment decompose one hard problem into two extremely hard problems

Why I’m optimistic about our alignment approach (Jan Leike)

The ground of optimization

There’s No Fire Alarm for Artificial General Intelligence

Alignment By Default

Public Static: What is Abstraction?

Mechanistic anomaly detection and ELK

Models Don’t “Get Reward”

AI safety via market making

Open Problems with Myopia

Risks from Learned Optimization in Advanced Machine Learning Systems

Steering GPT-2-XL by adding an activation vector

Thoughts on sharing information about language model capabilities

Frontier Model Training report

Cost Breakdown of ML Training

Why ML GPUs Cost So Much

Contra FLOPs

Challenges with unsupervised LLM knowledge discovery

Steering Llama-2 with contrastive activation additions

How to Catch an AI Liar: Lie Detection in Black-Box LLMS by Asking Unrelated Questions

Representation Engineering: A Top-Down Approach to AI Transparency

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Improving Activation Steering in Language Models with Mean-Centring

Discovering Language Model Behaviors with Model-Written Evaluations

Sleeper Agents: Training Deceptive LLMS That Persist Through Safety Training

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Debating with More Persuasive LLMs Leads to More Truthful Answers

Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

Studying Large Language Model Generalization with Influence Functions

Red Teaming Language Models with Language Models

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Universal and Transferable Adversarial Attacks on Aligned Language Models

Secret Collusion Among Generative AI Agents

Towards a Situational Awareness Benchmark for LLMs

Taken out of context: On measuring situational awareness in LLMs

Bogdan’s comment on simple forward passes

Evaluating Frontier Models for Dangerous Capabilities

On the Origins of Linear Representations in Large Language Models

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Saliency

Towards understanding based evaluations

When can we trust model evaluations?