Large Science Models: Digital Twins for Viral Evolution

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

ishanu_ch@uky.edu

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

BioNorad

Chattopadhyay, Ishanu, Kevin Wu, Jin Li, and Aaron Esser-Kahn. "Emergenet: Fast Scalable Pandemic Risk Assessment of Influenza A Strains Circulating In Non-human Hosts." (2023). Under Review 

PREEMPT

Predicting Future Mutations for Viral Genomes in the Wild

predict future  emergence risk

Hemaglutinnin

Neuraminidase

Mediates Cellular Entry

Surface structures  involved in host interaction

Mediates Cellular Exit

*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.

emergent macro-structure

Component predictor (Conditional Inference Tree*)

Example: Influenza A HA protein

Recursive

LSM

forest

LSM Forest of Conditional Inference Trees*

Revealing Emergent Cross-talk from observed sequence variations

>200,000 HA sequences

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H3
Northern Hemisphere

2021

Influenza A: HA

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H5

2013

Influenza A: HA

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}
222  223 224 --- 560
strain 1
strain 2
---
strain m

observables

samples

Distributions over alphabet \(\Sigma^i\)

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

Example: HA Site 223 on Influenza A

\(\psi^i\)

K G Y S T
\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

population

individual

estimate is always a non-empty non-degenerate distribution

missing observation

Large Science Models: Properties

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

 where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

g_{ij}(\psi) \;=\; \frac{1}{2}\,\frac{\partial^2}{\partial \psi^i\,\partial \psi^j}\,\theta^2(\psi,\psi')\Biggr|_{\psi'=\psi}
\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

 Induced  Riemannian metric tensor

This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics

Ergodic Projection

\psi_\star \triangleq \bigotimes_{i=1}^N\phi^i\left (\prod_{1}^{N-1}\varnothing\right )

(Sanov's Theorem, Pinkser's Inequality)

\(\psi\)

\(\psi'\)

\(\theta\)

"spatial average":  average of all plausible worldviews or states

* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400.  https://www.science.org/doi/full/10.1126/sciadv.adj0400

persistence probability

Ergodic dispersion

\Psi_\star = \theta(\psi,\psi_\star)

Central to Model Drift Quantification

Start with opinion vector with all entries missing

This is a standard Physics construct, quantifying curvature of the underlying latent geometry

Pr(\psi \rightarrow \psi')

Easily computable in LSM framework!

Apply \(\phi^i\)

Random variable quantifying dispersion around the spatial average of worlviews

const. scaling as \(N^2\) 

\frac{\delta \theta(x,y)}{\delta y}

Influenza Risk Assessment Tool (IRAT) scoring for animal strains

slow (months), quasi-subjective, expensive

*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm

24 scores in 14 years

~10,000 strains collected annually

CDC

Emergenet time: 1 second

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

BioNorad

Digital Twin & Fidelity of Simulation

\mathcal{N}_\epsilon(\psi) \triangleq \big\{ \psi': {\color{red}\forall i \ \psi'_i \sim \phi^i\left ( \psi^{-i}\right )} \wedge {\color{yellow} \theta(\psi,\psi') \leqq \epsilon }\big \}

Sample predicted distributions   

perturbed state within \(\epsilon\) of \(\psi\)

Digital Twin

-Neighborhood of state \(\psi\)

\epsilon

Definition

Sample neighborhood to impute missing data

\psi
\epsilon
}

LSM sampling: sampling the \(\epsilon\)-neighborhood of a strain reveals local "valid perturbations"

  • Predict new "likely" strains before first observation, given current population
{\color{Tomato}\psi_\star }\rightarrow \psi \rightarrow \cdots \rightarrow \psi'

Null state (all missing observations)

Valid perturbations/ simulations

  • Construct complete strains from scratch (from all "missing" site information)
  • Track likely trajectories through strain space

 

Note: The LSM can evolve too (the rules can change over time)

Emergent Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)

Define Lagrangian*

\frac{d}{dt} \left( \frac{\partial L}{\partial \dot{\psi}^m_i} \right) - \frac{\partial L}{\partial \psi^m_i} = 0

Via the Euler-Lagrange Equations\(^\dag\):

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation*

where \(-g^{km}\) is the inverse metric tensor

kinetic energy

potential energy

* Einstein notation used

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

\(^\dag\)

Principle of stationary action

Local potential field eqn

Question:

Why has the Mississippi lineage of Influenza C vanished from human circulation recently, while other lineages continue to exist?

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H0

H1

M0

LSM-clustering on human HEV sequences

The three bovine sequences are not part of these clusters (these are all human ICV HE), but we can still compute the distance of the individual human sequences to each of the three bovine strains. And the cluster they come closest to.. Pretty clearly is the one labelled as M0. The other clusters are labeled H0 and H1.

Distance of bovine sequences to M0 cluster

'C/Miyagi/2/94',  'C/Saitama/2/2000',  'C/Yamagata/3/2000',  'C/Miyagi/7/93',  'C/Miyagi/4/96',  'C/Saitama/1/2004',  'C/Miyagi/7/96',  'C/Greece/1/79',  'C/Yamagata/5/92',  'C/Miyagi/3/93',  'C/Miyagi/4/93',  'C/Kyoto/41/82',  'C/Nara/82',  'C/Hyogo/1/83',  'C/Miyagi/1/94',  'C/Miyagi/6/93',  'C/Miyagi/3/94',  'C/Mississippi/80',  'C/Yamagata/26/2004',  'C/Mississippi/80'

Variation by Time of collection

Suggests movement from M0 to H0 to H1

Estimation of Cluster Fitness from LSM

\omega(\mathcal{C}) =\frac{1}{\vert \mathcal{C}\vert} \sum_{x \in \mathcal{C}}\log Pr(x \rightarrow x)
M0 -64.251
H0 -32.586
H1 -15.964

Fitness calculations are based on the Emergenet model, and correspond to the estimate loglikelihood of a strain NOT PERTURBING out of the cluster. Thus the H1 cluster is the most "fit", where the strains have moved over time, and is also the largest in the data. Overlap on the collection times between H0 and H1 implies this is not simply a collection bias effect (the sizes of the clusters). This has resulted in the strain disappearing from humans, as the virus found a more fit niche on the landscape.

Maximal Site Contribution to Fitness Delta

8 75 87 97 141 154 165 178 181 183 203 205 211 216 230 252 327 361 506 588

{i} =\argmax \delta \omega(x)

Local Potential Fields

Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution

Stable

(captured by local extrema)

Free to move locally towards extrema

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)

Define Lagrangian\(\dag\)

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation\(\dag\)

where \(-g^{km}\) is the inverse metric tensor

kinetic energy

potential energy

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

\(^\dag\)

Principle of stationary action

Local potential field eqn

Future

  • Collaboration with Equine Expertise at UK to allow wet-lab validation

 

 

  • Atlas of viral digital twins
  • Bio-NORAD testbed
  • Pathogenicity assays
  • Future-proof Vaccine design

Feng Li

Professor, William Robert Mills Chair in Equine Infectious Diseases

Maxwell H. GluckEquine Research Center

Martin-Gatton College of Agriculture, Food and Environment

Influenza

HIV

COVID

CCHF

ishanu_ch@uky.edu

LSM-emergenet

By Ishanu Chattopadhyay

LSM-emergenet

Emergenet Discussion

  • 88