A Walkthrough of Toy Models of Superposition

Dec 27

In a collaboration with Jess Smith, we read through the Anthropic paper Toy Models of Superposition and discuss, give intuitions and high-level takeaways. Watch it here and check out the original paper here. An explainer I wrote may be a helpful reference.

This walkthrough mostly focuses on high-level ideas and themes, let me know if you want a part 2 that finishes going through the rest of the paper in detail!

We had some audio and connection issues as we recorded this, sorry for any disruptions to the viewing experience! And thanks to Jess for a valiant effort in editing the video and cleaning things up.

$\setCounter{0}$

Neel Nanda

A Walkthrough of Toy Models of Superposition

Mechanistic Interpretability Quickstart Guide

Analogies between Software Reverse Engineering and Mechanistic Interpretability

Neel Nanda