mishajw github notes

Reading list

Stack ranked
Backlog
New

Stack ranked

A Mathematical Framework for Transformer Circuits.

Backlog

Open Problems in Cooperative AI.
Density of States Estimation for Out-of-Distribution Detection.
CICERO.
Threat Model Literature Review - LessWrong.
Geometric Rationality.
AlphaFold 2 paper.
AGI debate

New

Delay, Detect, Defend: Preparing for a Future in which Thousands Can Release New Pandemics by Kevin Esvelt
Orthogonality Thesis - Arbital
Musings on the Speed Prior - LessWrong
TODO: Find writing on why we wouldn’t expect consequentialism to be convergent.
TODO: Find writing on why uncertainity in values wouldn’t work.
DeepMind: Generally capable agents emerge from open-ended play - LessWrong
Mechanistic anomaly detection and ELK - LessWrong
The limited upside of interpretability - LessWrong
How can we develop transformative tools for thought?
Staring into the abyss as a core life skill - LessWrong
Shard Theory in Nine Theses: a Distillation and Critical Appraisal - AI Alignment Forum
Shard Theory: An Overview - LessWrong
The “Minimal Latents” Approach to Natural Abstractions - LessWrong
Why have Sex? Information Acquisition and Evolution.
The story of VaccinateCA - Works in Progress
How to do theoretical research, a personal perspective - LessWrong
[1906.01820] Risks from Learned Optimization in Advanced Machine Learning Systems
[2210.01892] Polysemanticity and Capacity in Neural Networks
Finite Factored Sets - LessWrong
[2111.06206] Towards Axiomatic, Hierarchical, and Symbolic Explanation for Deep Models
[2301.05062] Tracr: Compiled Transformers as a Laboratory for Interpretability