Welcome!

I’m currently at Anthropic working on evaluating misalignment risk. I previously worked at Conjecture and at Google in a team collaborating with DeepMind.

[2024-10-19] Sabotage evaluations for frontier models
[2024-04-01] Scaling calculator
[2024-04-01] Sparse autoencoders depend too much on theories
[2024-03-15] Breaking down Situational Awareness
[2024-02-24] How well do truth probes generalise?
[2024-02-21] Jailbreaking GPT-4 with the tool API
[2024-02-15] Conway’s law and coding assistants
[2022-12-31] AI x-risk model (perpetual draft)
[2022-10-18] Distilled Representations Research Agenda
[2022-10-12] Preventing steganography in latent vectors
[2022-09-25] Feature visualisation
[2021-03-13] vaxtldr.uk: England vaccination progress TL;DR
[2020-02-17] Introducing KIPA: Distributed Key to IP Address translation
[2018-11-21] Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions

I often write to think things through, but I then don’t robustly critique or tidy my writing, resulting in writing that conveys something with low confidence. But I still like to have them linked:

[2024-03-17] In research, optimise for bitrate
[2024-02-28] Correlated failures and coordinated misalignment
[2024-02-28] Concerns with RSPs
[2024-01-07] Evals are as hard as product development
[2023-05-13] Trust

I’m just living off thoughts and feelings, with a weak and a lazy mind