"Sparse Autoencoders Find Highly Interpretable Directions in Language Models" by Logan Riggs et al
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language ModelsWe use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.<br/><br/>Source:<br/>https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in<br/><br/>Narrated for LessWrong by TYPE III AUDIO.<br/><br/>Share feedback on this narration.<br/><br/>[125+ Karma Post] ✓