SWC Neuroinformatics 2024
head-fixed mouse
image 50K neurons
single neuron activity
decode "angle > 45°?" averaged over trials
Encoding:
Decoding:
Reverse-engineering:
Reverse-engineering:
a single unit ("neuron"), linear:
y^=b+i∑wixi
preactivation: z=b+i∑wixi
activation function:
g(z)=max(0,z)
postactivation: h=g(b+i∑wixi)
g(z)=tanh(z)
g(z)=max(0,z)
g(z)=1+e−x1
g(z)=max(αz,z)
hyperbolic tangent
sigmoid
rectified linear (ReLU)
leaky ReLU
a single layer
(collection of neurons):
h=g(Wx)
a deep net
(sequence of layers):
h(ℓ)=g(W(ℓ)h(ℓ−1))
h(0)=x
y^=W(L)h(L−1)
feedforward net
multi-layer perceptron
multi-layer perceptron
...anything that can be topologically ordered!
(all DAGs)
convolutional net
recurrent net
autoencoder
ground truth input
reconstructed input
("surrogate loss")
ground truth target
predicted target
intermediate representation
mean squared error (MSE)
usually for regression
cross entropy (xent)
usually for classification
supervised
unsupervised
(L2) reconstruction error
usually for an autoencoder
contrastive loss
usually for self-supervised learning
Loss function for each datapoint:
ℓ(y^,y)=21(y−y^)2
Training corresponds to searching for a minimum:argw1,w2minL(w1,w2)
L(w1,w2)=N1i=1∑Nℓ(yi^,yi)
Let's consider the average loss across N training examples as a function of the weights:
y^=w1x1+w2x2
(equivalent to linear regression!)
linear neuron, MSE
L(w1,w2)
deep neural network
deep neural network
(low-D projection)
L(W(1),W(2),…)
Idea: Pick a starting point. Use local information about the loss function to decide where to move next.
The best local information is usually the direction of steepest decrease of the loss, equivalent to the negative of the gradient:
L(w1,w2)
Think: A ball rolled from any point on the loss surface will find its way to the lowest possible height.
A sufficient condition for all minima to be global minima.
i.e., if minima exist (the loss is bounded below), then all minima are equally good.
(Or will roll to negative infinity!)
L(w1,w2)
(strictly) convex → unique global minimum
L(w1,w2)
L(W(1),W(2),…)
nonconvex
→
local minima, saddle points, plateaux, ravines
(possible but not guaranteed)
What can't we do with a linear neuron?
No linear decision boundary can separate the purple and yellow classes!
(Impossible to find weights for the linear neuron that achieve low error.)
exclusive OR (XOR)
Adding just one multiplicative feature makes the problem linearly separable;
i.e., with this augmented "dataset", we can find weights that enable a linear neuron to solve the task.
Neural networks automate this process of feature learning!
Recall: steepest descent follows the negative direction of the gradient.
and similarly for the other weight.
∇L(W(1),W(2),…)
is very complex! Can we do better than chain rule for every weight?
forward pass:
(compute loss)
backward pass:
(propagate error signal)
This procedure generalizes to all ANNs because they are DAGs!
∂ℓ∂ℓ=1
∂y^∂ℓ=∂ℓ∂ℓ(y−y^)
∂W(2)∂ℓ=∂y^∂ℓhT
∂h∂ℓ=W(2)T∂y∂ℓ
∂z∂ℓ=∂h∂ℓ⋅g′(z)
∂W(1)∂ℓ=∂z∂ℓxT
z=W(1)x
y^=W(2)h
h=g(z)
ℓ=21∥y−y^∥2
Neural networks are a normative model:
They produce behavior and neural activity that is similar to biological intelligence when trained in an ecological task setting.
But the architectures, not to mention the neurons, as well as the learning rules are not biologically plausible.
We can still:
Architectural manipulations motivated by biological plausibility.
hierarchical convolutional structure
local and long-range recurrence