Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Harshavardhan Kamarthi

Presented at Data Seminar 10 Dec 2021

Introduction

Survey/opinion paper
Using differentiable programs to model a wide range of biophysical phenomenon
Initiated by 4 key developments
- Pattern recognizers
- Bespoke Modeling
- Joint optimization
- Software frameworks
Example applications

Pattern Recognizers

DNN and ML models in general are powerful representation algorithms
Complex inputs: images, speech, multimodal
Bioapplications:
- Cell classification, segmentation
- Protein structure prediction
- Drug discovery

Examples

CNN for cell classification [Oei+ Plos One 2019]

Cancer detection [Zhang+ Nature 2019]

Examples

Predicting DNA Specificities [Alipanahi+ Nature 2019]

Bespoke Modeling

DNNs can leverage aspects of chemical/physical process
Encode prior knowledge or priors into the architecture
- CNN: Transitional invariance
Ex: Learning protein configurations from EM images [Rosenbaum+ 2021 DeepMind]

Model

Encoder

Input: EM image
Encodes into vector representation
Split into pose and conformation latent variables

Decoder

Combines pose and backbone configuration into sequence of 3D atom configurations (deterministic formula)
Renderer converts 3D configuration into image.

Training

Loss on backbone config along with usual VAE loss.

Joint Optimization

Multiple components of architecture are composable and trainable jointly
Components learn from each Other

Software Frameworks

Off-shelf NN architectures
Combine with custom Math transformations
Only worry about forward pass.

DiffBio Primitives

Biological Patterns: Capture patterns/ representations from mutiple sources of data
Mechanical Priors: Structure, Chemistry
Data Priors: Heterogenous Noisy incomplete data

Mechanistic Priors

Translation and rotation invariance: Generalize CNN to Lie Groups [Dehmamy+ ICLR 2021, Cohen+ NeurIPS 2019]
Use vocabulary of discrete features [Gianza+ Nature 2020]

Data Priors

Deal with incomplete noisy and heterogeneous data
Differentiable programs enable learning sophisticated error models
Can combine simple features from experiments with complex equations using Neural ODE solvers.

Homogenizing PPI data

Goal: Learn Protein-Protein interaction affinities from diverse data sources [Cunningham+ Nature 2020]
Data sources:
- Quantitative binding data: Direct measurement results from experiments
- Cell Extracts Mass spectrography: Noisy
- Functional data from genetic expression studies: Readily available but very indirect
- Structural data: used to measure affinities (for training labels)

Homogenizing PPI data

Two step training
Homogenizers depending on data type

AlphaFold

Goal: Protein Structure Prediction
Input: Protein sequence (sequence of symbols)
Output: 3D coordinates of tertiary structure

AlphaFold

1. Levergaes evolutionarily related sequences from Protein Data bank and perform multiple sequence alignment

2. Pass these set of sequences through transformer based architecture (Evoformer) and directly predict the 3D sequence

AlphaFold

1. Transformer Architecture learns interpretable intermediate folding stages, especially for complex proteins.

2. Intermediate Loss functions: Local protein structure from data bank, full sequence using torsion angles.

2. Achieves SOTA performance

Upcoming developments

Self-supervised learning:
- Learning without lablels
- Eg: masking in sequences
- Similarity in representations from homologous proteins
Generative models
- Molecule generation, RNA,DNA imputation
Simulation
- Data augmentation, inverse graphics generation (2D -> 3D)
Non-differentiable discrete space
- Eg: Organic synthesis, RNA design
- Reinforcement Learning

Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Introduction

Pattern Recognizers

Examples

Examples

Bespoke Modeling

Model

Encoder

Decoder

Joint Optimization

Software Frameworks

DiffBio Primitives

Mechanistic Priors

Data Priors

Homogenizing PPI data

Homogenizing PPI data

AlphaFold

AlphaFold

AlphaFold

Upcoming developments

DiffBio

More from Harshavardhan Kamarthi