Welcome!
I’m currently at Anthropic working on evaluating misalignment risk. I previously worked at Conjecture and at Google in a team collaborating with DeepMind.
- [2024-10-19] Sabotage evaluations for frontier models
- [2024-04-01] Scaling calculator
- [2024-04-01] Sparse autoencoders depend too much on theories
- [2024-03-15] Breaking down Situational Awareness
- [2024-02-24] How well do truth probes generalise?
- [2024-02-21] Jailbreaking GPT-4 with the tool API
- [2024-02-15] Conway’s law and coding assistants
- [2022-12-31] AI x-risk model (perpetual draft)
- [2022-10-18] Distilled Representations Research Agenda
- [2022-10-12] Preventing steganography in latent vectors
- [2022-09-25] Feature visualisation
- [2021-03-13] vaxtldr.uk: England vaccination progress TL;DR
- [2020-02-17] Introducing KIPA: Distributed Key to IP Address translation
- [2018-11-21] Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
I often write to think things through, but I then don’t robustly critique or tidy my writing, resulting in writing that conveys something with low confidence. But I still like to have them linked:
- [2024-03-17] In research, optimise for bitrate
- [2024-02-28] Correlated failures and coordinated misalignment
- [2024-02-28] Concerns with RSPs
- [2024-01-07] Evals are as hard as product development
- [2023-05-13] Trust
I’m just living off thoughts and feelings, with a weak and a lazy mind