[Summary] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TL;DR Transformer models often map multiple concepts to the same neuron, making it unclear what features they learn. This work makes inner representations interpretable by using a sparse autoencoder layer to map neurons to concepts. This method extracts relatively monosemantic concepts, can steer transformer generation, and shows that 512 neurons can represent tens of thousands of features. Method A major challenge in reverse engineering neural networks is the curse of dimensionality: as models grow, the latent space volume increases exponentially....

March 15, 2025 · 2 min · 333 words