The Building Blocks of Interpretability
Christian Yoon
- A Sparse Autoencoder (SAE) is a tool used to identify interpretable features inside neural networks.
- These features are groups of neurons whose activations represent meaningful patterns like “dog,” “texture,” or “shape.”
- The main goal is to make sense of what a neural network has learned by turning complex, dense activations into clear, human-understandable components.
Sparse Autoencoders
Imagine you're trying to understand what makes a song catchy by breaking it down into individual elements, like the beat, the melody, and the lyrics, rather than just listening to the whole thing at once. That's essentially what sparse autoencoders do with AI models: they help us peek inside these black boxes and see what specific concepts or features the AI has learned. We're going to explore how this technique is helping researchers finally understand what's actually happening inside neural networks.
-
In deep models, neuron activations are dense, meaning almost all neurons activate for many inputs.
-
Each neuron encodes multiple overlapping meanings
-
Called superposition
-
-
Because of this, it’s hard to tell what any single neuron actually represents
-
Sparse autoencoders help separate these overlapping signals into more interpretable features.
Problems with Complex ML Models
-
A sparse autoencoder learns a new representation where most components are zero (sparse).
-
Each nonzero component ideally corresponds to a specific concept.
-
The encoder maps activations to this sparse, high-dimensional space, and the decoder reconstructs the original activations.
-
The model is trained to rebuild accurately while keeping the representation as sparse as possible.
Core Idea of Sparse Autoencoders
-
SAEs have 2 parts:
Encoder: compresses or expands input activations into a new feature space.
Decoder: reconstructs the original input from that space.
-
During training, we minimize two things:
Reconstruction loss → how close the decoded output is to the original activations.
Sparsity penalty → encourages most feature activations to be zero.
This balance ensures the model captures meaningful information without clutter.
Core Idea of Sparse Autoencoders
Perhaps you could illustrate with something like this?

Storyboard Your Narrative

the hook
scene 1
scene 2
scene 3
scene 4
the takeaway
deck
By Dan Ryan
deck
- 27