Transformers

TL;DR Large Language Models (LLMs) are often perceived as “black boxes,” making their decision-making and reasoning processes difficult to interpret. A novel method simplifies these complex models by replacing internal nonlinear layers with linear modules tailored to clearly understandable features. This approach reveals structured reasoning, planning behaviors, and even hidden intentions within the model’s computations. Method Interpreting LLMs is challenging because individual neurons often represent multiple, unrelated concepts simultaneously (polysemanticity). To address this, the approach creates a simplified “replacement model”, preserving most of the original model’s performance while enhancing interpretability through these steps:...