Christian Yoon
Input goes in and output comes out, but inner workings are hidden
Models make accurate predictions but we can’t see the steps in between
This makes them hard to trust and debug
SAEs act like opening the vending machine panel to see how it works
They break model activations into interpretable components
Each “part” (or feature) activates for a specific concept, letting us see what the model notices
SAEs help turn the black box into something we can inspect, label, and understand
In complex/deep models, neuron activations are dense
Almost all neurons activate for many inputs
Hard to tell what any single neuron actually represents
SAEs help us separate these signals
SAEs have 2 parts:
Encoder: compresses input activations into a new feature space.
Decoder: reconstructs the original input from that space.
A sparse autoencoder learns a new representation where most components are zero (sparse).
Each nonzero component ideally corresponds to a specific concept.
The encoder maps activations to this sparse, high-dimensional space, and the decoder reconstructs the original activations.
The model is trained to rebuild accurately while keeping the representation as sparse as possible.
Don't worry if not much of this makes sense right now, I'll show visuals later!
These 2 concepts are super important:
Reconstruction loss → how close the decoded output is to the original activations.
Sparsity penalty → encourages most feature activations to be zero.
In Laymen's terms, SAE's try to find the simplest way to represent neurons in a model!
Dog
Cat
Features most correlated to dog
Features most correlated to cat
Latent vector
Latent vector
the hook
scene 1
scene 2
scene 3
scene 4
the takeaway