Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Harshavardhan Kamarthi

Presented at Data Seminar 10 Dec 2021

Introduction

  • Survey/opinion paper
  • Using differentiable programs to model a wide range of biophysical phenomenon
  • Initiated by 4 key developments
    • Pattern recognizers
    • Bespoke Modeling
    • Joint optimization
    • Software frameworks
  • Example applications

Pattern Recognizers

  • DNN and ML models in general are powerful representation algorithms
  • Complex inputs: images, speech, multimodal
  • Bioapplications:
    • Cell classification, segmentation
    • Protein structure prediction
    • Drug discovery

Examples

CNN for cell classification [Oei+ Plos One 2019]

Cancer detection [Zhang+ Nature 2019]

Examples

Predicting DNA Specificities [Alipanahi+ Nature 2019]

Bespoke Modeling

  • DNNs can leverage aspects of chemical/physical process
  • Encode prior knowledge or priors into the architecture
    • CNN: Transitional invariance
  • Ex: Learning protein configurations from EM images [Rosenbaum+ 2021 DeepMind]

Model

Encoder

  • Input: EM image
  • Encodes into vector representation
  • Split into pose and conformation latent variables

Decoder

  • Combines pose and backbone configuration into sequence of 3D atom configurations (deterministic formula)
  • Renderer converts 3D configuration into image.

 

Training

  • Loss on backbone config along with usual VAE loss.

Joint Optimization

  • Multiple components of architecture are composable and trainable jointly
  • Components learn from each Other

Software Frameworks

  • Off-shelf NN architectures
  • Combine with custom Math transformations
  • Only worry about forward pass.

DiffBio Primitives

  • Biological Patterns: Capture patterns/ representations from mutiple sources of data
  • Mechanical Priors: Structure, Chemistry
  • Data Priors: Heterogenous Noisy incomplete data

Mechanistic Priors

  • Translation and rotation invariance: Generalize CNN to Lie Groups [Dehmamy+ ICLR 2021, Cohen+ NeurIPS 2019]
  • Use vocabulary of discrete features [Gianza+ Nature 2020]

Data Priors

  • Deal with incomplete noisy and heterogeneous data
  • Differentiable programs enable learning sophisticated error models
  • Can combine simple features from experiments with complex equations using Neural ODE solvers.

Homogenizing PPI data

  • Goal: Learn Protein-Protein interaction affinities from diverse data sources [Cunningham+ Nature 2020]
  • Data sources:
    • Quantitative binding data: Direct measurement results from experiments
    • Cell Extracts Mass spectrography: Noisy 
    • Functional data from genetic expression studies: Readily available but very indirect
    • Structural data: used to measure affinities (for training labels)

Homogenizing PPI data

  • Two step training
  • Homogenizers depending on data type

AlphaFold

  • Goal: Protein Structure Prediction
  • Input: Protein sequence (sequence of symbols)
  • Output: 3D coordinates of  tertiary structure

AlphaFold

1. Levergaes evolutionarily related sequences from Protein Data bank and perform multiple sequence alignment

2. Pass these set of sequences through transformer based architecture (Evoformer) and directly predict the 3D sequence

AlphaFold

1. Transformer Architecture learns interpretable intermediate folding stages, especially for complex proteins.

2. Intermediate Loss functions: Local protein structure from data bank, full sequence using torsion angles.

2. Achieves SOTA performance

Upcoming developments

  • Self-supervised learning:
    • Learning without lablels
    • Eg: masking in sequences
    • Similarity in representations from homologous proteins
  • Generative models
    • Molecule generation, RNA,DNA imputation
  • Simulation
    • Data augmentation, inverse graphics generation (2D -> 3D)
  • Non-differentiable discrete space
    • Eg: Organic synthesis, RNA design
    • Reinforcement Learning

DiffBio

By Harshavardhan Kamarthi