Sidney Bell, PhD
Last updated 2024
Text
Loops
Conditionals
Variables
Functions
Data encoding
Model form
Training algo
Evaluation
1. Curate input: Data
(counts, images, other)
2. Formulate: Based on your hypothesis, encode relationships between data “features” as mathematical function(s)
3. Train: “Learn” the value of variables in the model functions through iterative adjustment and evaluation
4. Test: the generalizability of these functions on held-out data
5. Interpret: this is the tricky part :)
Vaccines are more effective against some variants than others. Why?
Bell et al, eLife 2019
(Glossing over for now)
Antigenic distance between viruses i, j
Mutations between i,j
Effect of each mutation (unknown to "learn")
Maximum likelihood
P(data | hypothesis is true)
Easier to estimate
Bayesian
P(hypothesis is true | data)
Harder to estimate
Maximum likelihood
P(data | hypothesis is true
under these conditions / parameters)
3. Train: Find the parameters that maximize the
"maximum likelihood"
(see what we did there?)
4. Test: Assess how well this hypothesis & best-fit parameters (experimental conditions) explain real-world data
5. Interpret: If it's a reasonably good model,
use the model to learn other things
"Training error": how different is the model's guess from the actual data?
"Regularization" is a corollary hypothesis:
Most antigenic change will be the result of a few large changes, not many small changes. So we expect most values of d to be 0, such that the distribution of d looks like an exponential distribution.
"Cost Function"
Try a value => assess training error => update value
Lots of algorithms + implementations readily available for how to pick the next value to try.
"Root mean squared error" = 0.75 (95% CI 0.74–0.77)
(95% CI 0.77–0.79)
On average, the model's predictions of the antigenic distance between pairs of strains is within ~0.75 normalized log2 titer units
On average, this model (hypothesis) is able to explain about 78% of the variation in titer distances between strains
Data
Interpolated data using model parameters
Antigenic distance between viruses i, j
Effect of each mutation
We learned these and they reliably predict observed data
Predict antigenic distance between existing vaccine and the new circulating strain by adding up these values for each mutation in its genome
"Prompting"
Interpolated data using model parameters
1. Curate input
2. Formulate your hypothesis
3. Train to learn params
4. Test generalizability
5. Interpret
Additional model with another task
Models are just hypotheses, written down in math. They are not magic or scary. You use them every day.
There are a lot of decisions (art) involved in model design. Use this as a starting point for understanding and evaluation.
"Training" is just iteratively improving your estimates.
"Testing" is evaluating generalizability -- KEY!
Common interpretation pathways / tasks include interpolation, prediction, simulation, and input to other models.
You can think of neural networks as just many linear regressions, recursively chained together
A "neuron" = one linear regression
=> some non-linear "activator function" (design decision) => output of this neuron, in the first layer
+ b0
Gene 1
Gene 2
Gene 3
1st "layer"
Linear model of Cell 1 as a f'n of its gene expression vector
Repeat for Cells 2, 3, ... N
2nd "layer"
Linear model of Cell N as a f'n of the outputs of layer 1, across all cells
Nth "layer"
Output values* of the last layer are used to calculate how well your model is performing (Evaluation)
Update weights and biases for every node in the network
(Training algorithm)
Nth "layer"
Outputs vary based on the activation function and number of nodes in the final layer
Neural network 1
Evaluation
Neural network 2
Simulated samples
Probability distribution of expression of "meta genes"
Input to other classifier
Anomaly detection
Embedding
Prompt
"Attention" values are just weights applied to the input like before. They allow for:
1 - Being able to look at the whole input at once
2 - Weight the most salient bits of the input more heavily (adapted for each unique input)
"Understand-er"
"Generator"