Correlated failures and coordinated misalignment

If we have a stack of unreliable HDDs, we can combine them into a single reliable logical unit using RAID.

If we have a potentially insecure binary, we can run it in a virtual machine to isolate it from the rest of the world, maintaining the security of your network.

If we have a stack of unreliable LLMs, responsible for ever larger parts of our economy, how can we make these reliable?

In a system, reliability-inducing components will take previously unreliable components and then make them reliable.

In the case of RAID, the fact that disk failures are uncorrelated is exploited. We copy the data (for example) across four disks, if each disk has a failure rate of 1% per year, then we end up with a failure rate of 0.000001%.

In the case of insecure binaries, we exploit the interface of the insecure binary. If we limit its inputs to keystrokes when the window is selected, and its outputs to pixels on a screen, then this rules out an extremely large chunk of failure cases.

One method for increasing the truthfulness of unreliable LLMs is debate: get two LLMs to argue opposite sides of an argument, and a human (or another LLM) judges the outcome. This induces reliability by exploiting two slightly different components: firstly that the LLMs are anti-correlated, as they have inverse incentives, and secondly that debate generally favours truth. Although there are of course ways to convince someone of falsehoods in debate settings, there is some signal among the noise - a debate structure attempts to elicit it.

Think of your system as a graph. Some nodes are unreliable, while some are reliability inducing. The unreliable parts are often where the power of your system lies - you can’t build a product without using storing data! - but wouldn’t work well without the injection of reliability.

The above examples induce reliability by balancing correlations, limiting interfaces, and eliciting biases (debate is biased towards truth).

But each additional reliability-inducing component also introduces an amount of unreliability: another moving gear, another place things can go wrong. Your RAID implementation could have a bug. Your VM could have a security hole. Debate could prefer truth when GPT-4 debates, but for GPT-5 prefer persuasive speeches that draw on human biases. It’s a balancing act.

What else can go wrong?

You use RAID to keep redundant copies of data across four disks. This is hosting a small website, which lands on the front page of Reddit and starts to get more use. The disks keep up, but some code starts to repeatedly write to the same part of the virtual RAID disk again and again. This strains the HDDs, and they all start to fail in quick succession.

You’re running untrusted software that was written in a country known for its state-sponsored cyber attacks. But that’s OK, you’re running it in a VM - developed by a company that has a satellite office in the same country. You notice some strange logs on your internal network monitoring tools.

You set up debate between two LLMs. Sure, they’re fine-tuned from the same base model, but they’re fine-tuned on different datasets using different techniques. During testing, you notice that they’ve convinced the judge of subtly incorrect answers. That’s strange - it would’ve been easy for one LLM to point out the mistake in the other LLM’s argument.

All these problems are caused by correlated failures. When designing secure, reliable and safe systems, we unfortunately can’t always trust that failures will be uncorrelated.

When trying to build safe LLM systems, there are a bunch of reliability-inducing tools we can bring out:

If our goal is to engineer safe systems, we need to be careful that (1) we’re not introducing unnecessary additional unreliability and (2) our failures are as uncorrelated as possible.