Generative AI as Engines of Scientific Discovery:

Understanding Viral Emergence

To

Microbial Ecosystems To

Opinion Dynamics and Deception

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

The Laboratory for Zero-knowledge Discovery

ishanu_ch@uky.edu

 

09.10.2025

first wave

 

rule-based systems

(1950s-1980s)

second wave

 

Big Data / ML / Deep Learning

recognize patterns, make predictions, might improve over time, but struggle on tasks not trained for

(1990s - 2010s)

third wave

 

contextual reasoning, generalizable models, - towards human-like or trans-human intelligence

(2020s - )

Stamping Out the Next Pandemic **Before** The First Human Infection

"BioNORAD"

Assuming  a 1000 species ecosystem, and 1 successful experiment every day

 to discern a single two-way relationship,

we would need 1,368 years to go through all possibilities.

Digital Twin for the Human Microbiome 

The Missing Heritability Problem

How do we explore the darkome?

Autism heritability: 80-90%

GWAS explains: 5-10%

Chattopadhyay, Ishanu, Kevin Wu, Jin Li, and Aaron Esser-Kahn. "Emergenet: Fast Scalable Pandemic Risk Assessment of Influenza A Strains Circulating In Non-human Hosts." (2023). Under Review 

PREEMPT

Predicting Future Mutations for Viral Genomes in the Wild

predict future  emergence risk

Hemaglutinnin

Neuraminidase

Mediates Cellular Entry

Surface structures  involved in host interaction

Mediates Cellular Exit

*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.

emergent macro-structure

Component predictor (Conditional Inference Tree*)

Example: Influenza A HA protein

Recursive

LSM

forest

LSM Forest of Conditional Inference Trees*

Revealing Emergent Cross-talk from observed sequence variations

>200,000 HA sequences

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H3
Northern Hemisphere

2021

Influenza A: HA

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H5

2013

Influenza A: HA

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}
222  223 224 --- 560
strain 1
strain 2
---
strain m

observables

samples

Distributions over alphabet \(\Sigma^i\)

cross-talk

Example: HA Site 223 on Influenza A

\(\psi^i\)

K G Y S T
\Sigma^i

population

individual

missing observation

We want insight, not just a classifier.

We want to discover the "physics" of  complex systems

Large Science Models: Mathematical Framework

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

Digital Twin

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

estimate is always a non-empty non-degenerate distribution

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

Large Science Models: Properties

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

 where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

g_{ij}(\psi) \;=\; \frac{1}{2}\,\frac{\partial^2}{\partial \psi^i\,\partial \psi^j}\,\theta^2(\psi,\psi')\Biggr|_{\psi'=\psi}
\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

 Induced  Riemannian metric tensor

This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics

(Sanov's Theorem, Pinkser's Inequality)

persistence probability

This is a standard Physics construct, quantifying curvature of the underlying latent geometry

Pr(\psi \rightarrow \psi')

const. scaling as \(N^2\) 

\(\psi\)

\(\psi'\)

\(\theta\)

xx

 

 

 

 

Large Science Models: Properties

Ergodic Projection

\psi_\star \triangleq \bigotimes_{i=1}^N\phi^i\left (\prod_{1}^{N-1}\varnothing\right )

"spatial average":  average of all plausible worldviews or states

* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400.  https://www.science.org/doi/full/10.1126/sciadv.adj0400

Ergodic dispersion

\Psi_\star = \theta(\psi,\psi_\star)

Central to Model Drift Quantification

Start with opinion vector with all entries missing

Easily computable in LSM framework!

Apply \(\phi^i\)

Random variable quantifying dispersion around the spatial average of samples

Digital Twin & Perturbation Simulation

\mathcal{N}_\epsilon(\psi) \triangleq \big\{ \psi': {\color{red}\forall i \ \psi'_i \sim \phi^i\left ( \psi^{-i}\right )} \wedge {\color{yellow} \theta(\psi,\psi') \leqq \epsilon }\big \}

Sample predicted distributions   

perturbed state within \(\epsilon\) of \(\psi\)

Digital Twin

-Neighborhood of state \(\psi\)

\epsilon

Definition

Sample neighborhood to impute missing data

\psi
\epsilon
}

LSM sampling: sampling the \(\epsilon\)-neighborhood of a strain reveals local "valid perturbations"

  • Predict new "likely" strains before first observation, given current population

  • Construct complete strains from scratch (from all "missing" site information)

  • Track likely trajectories through strain space

{\color{Tomato}\psi_\star }\rightarrow \psi \rightarrow \cdots \rightarrow \psi'

Null state (all missing observations)

Valid perturbations/ simulations

\frac{\delta \theta(x,y)}{\delta y}

Influenza Risk Assessment Tool (IRAT) scoring for animal strains

slow (months), quasi-subjective, expensive

*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm

24 scores in 14 years

~10,000 strains collected annually

CDC

Emergenet time: 1 second

Stamping Out the Next Pandemic **Before** The First Human Infection

BioNorad

BioNorad

Uncovering "Physics"

Emergent Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)
\frac{d}{dt} \left( \frac{\partial L}{\partial \dot{\psi}^m_i} \right) - \frac{\partial L}{\partial \psi^m_i} = 0

Via the Euler-Lagrange Equations\(^\dag\):

Define Lagrangian*

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation*

where \(-g^{km}\) is the inverse metric tensor

kinetic energy

potential energy

* Einstein notation used

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

\(^\dag\)

Principle of stationary action

Local potential field eqn

Question:

Why has the Mississippi lineage of Influenza C vanished from human circulation recently, while other lineages continue to exist?

Feng Li

William Robert Mills Chair in Equine Infectious Diseases

Maxwell H. Gluck Equine Research Center

Martin-Gatton College of Agriculture, Food and Environment

LSM Forest

Recursive LSM forest: hyperlinked nodes capturing emergent macro-structures

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

H0

H1

M0

LSM-clustering on human HEV sequences

H0

 

H1

 

M0

Clusters from Human ICV strains

bovine strains are closest to this

Distance of bovine sequences to M0 cluster

'C/Miyagi/2/94',  'C/Saitama/2/2000',  'C/Yamagata/3/2000',  'C/Miyagi/7/93',  'C/Miyagi/4/96',  'C/Saitama/1/2004',  'C/Miyagi/7/96',  'C/Greece/1/79',  'C/Yamagata/5/92',  'C/Miyagi/3/93',  'C/Miyagi/4/93',  'C/Kyoto/41/82',  'C/Nara/82',  'C/Hyogo/1/83',  'C/Miyagi/1/94',  'C/Miyagi/6/93',  'C/Miyagi/3/94',  'C/Mississippi/80',  'C/Yamagata/26/2004',  'C/Mississippi/80'

Mississippi Lineage

Variation by Time of collection

Suggests movement from M0 to H0 to H1

Estimation of Cluster Fitness from LSM

\omega(\mathcal{C}) =\frac{1}{\vert \mathcal{C}\vert} \sum_{x \in \mathcal{C}}\log Pr(x \rightarrow x)
M0 -64.251
H0 -32.586
H1 -15.964
  • Fitness calculations correspond to the estimate log likelihood of a strain NOT PERTURBING out of the cluster.
  • Thus the H1 cluster is the most "fit", where the strains have moved over time.
  • Overlap of collection times between H0 / H1 implies this is not simply a collection bias effect (the sizes of the clusters).
  • This has resulted in the strain disappearing from humans, as the virus found a more fit niche on the landscape.

Maximal Site Contribution to Fitness Delta

8 75 87 97 141 154 165 178 181 183 203 205 211 216 230 252 327 361 506 588

{i} =\argmax \delta \omega(x)

Local Potential Fields

Stable

(captured by local extrema)

Free to move locally towards extrema

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)

Define Lagrangian\(\dag\)

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation\(\dag\)

where \(-g^{km}\) is the inverse metric tensor

kinetic energy

potential energy

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

\(^\dag\)

Principle of stationary action

Local potential field eqn

THE PROBLEM

Assuming  a 1000 species ecosystem, and 1 successful experiment every day to discern a single two-way relationship, we would need 1,368 years to go through all possibilities.

Digital Twin for the Maturing Human Microbiome 

  • Forecast microbiome maturation trajectories

 

  • Predict neurodevelopmental deficits

Boston U

U Chicago 

Two centers

Ability to "fill in" missing data is equivalent to making trajectory forecasts

predicting neurodevelopmental deficits

forecasting ecosystem trajectories

What other problems can it solve?

Phase 1

Phase 2

PREPARE: Pioneering Research for Early Prediction of Alzheimer's and Related Dementias EUREKA Challenge

Algorithm for early diagnosis

Find Data for early prediction

Phase 1

Phase 2

Second Prize 40,000 USD

Lets give them:

  • 1M patients clinical data diagnosed with ADRD/AD 60-80 years
  • 1M African-American patients from Chicagoland
  • Open source - GNU public license

licensed patient data

digital twin

(generative AI)

teomims

(open cohort)

Phase 1

Phase 2

Uncorrelated, yet indistinguishable !!

Can A Generative AI Tell if you Are Lying?

VeRITaAS

Vetting Response Integrity from
cross-Talk in Adversarial
Surveys

Data set: 300 participants with confirmed PTSD answering a 211 question set based on PCL5+

LSM

Hidden structure of cross-talk between responses to interview items

PTSD diagnostic interview

Number of possible responses

Minimum Performance (n=624)

Average Time: 3.5 min

No. of questions: 20

AUC > 0.95

PPV > 0.86

NPV > 0.92

At least 83.3% sensitivity at 94% specificity

Minimum AUC = \(0.95 \pm 0.005\)

Cannot be coached, or memorized

Datasets for training & validation

1. VA (n=294)

2. Prolific (n=300)

3. Psychiatrists (n=30)

10^{25}

Beat the test!

200 participants in

US

100 participants in

UK

30 forensic psychiatrists

10

6

1

Can-You-Fake-PTSD Challenge Results

successful attempts

A General Framework to analyze complex systems

  • Develop Foundation models of complex systems with
    • hundreds to thousands of evolving variables with apriori unknown cross-talk
    • no governing equations are know a priori
    • reflexivity: system changes if observed

LSM for Social Surveys (GSS)

  • Set of conditional inference trees (CIT)
    • Strict statistical guarantees: quantifies inference uncertainty
  • Each tree models exactly one variable as a function of potentially all other variables
  • Non-leaf nodes are "hyperlinked" to other trees

Large Science Models

GSS 2018 dataset

  • Each predictor is inferred independently
  • Can scale up to thousands of variables in Python implementation
  • Further scale-up \(10^6 - 10^8\) needs C/C++ implementation

Full Example  of Hyperlinked Trees

Global Emergent Structure via Clusters & Poles

2018 GSS

\theta_t(\psi_+,\psi_-)

Polar separation over time

2016 Presidential Election Vote Prediction

2004

abany no yes
abdefctw always wrong not wrong at all
abdefect no yes
abhlth no yes
abnomore no yes
abpoor no yes
abpoorw always wrong not wrong at all
abrape no yes
absingle no yes
bible inspired word book of fables
colcom fired not fired
colmil not fired not allowed
comfort strongly agree strongly disagree
conlabor hardly any a great deal
godchnge believe now, always have don't believe now, never have
grass not legal legal
gunlaw oppose favor
intmil very interested not at all interested
libcom remove not remove
libmil not remove remove
maboygrl true false
owngun yes no
pillok agree strongly agree
pilloky strongly disagree strongly agree
polabuse no yes
pray several times a day never
prayer disapprove approve
prayfreq several times a day never
religcon strongly disagree strongly agree
religint strongly disagree strongly agree
reliten strong no religion
rowngun yes no
shotgun yes no
spkcom not allowed allowed
spkmil allowed not allowed
taxrich about right much too low
     

conservative pole

\psi_+

liberal pole

\psi_-

Clustering LSM distance \(\theta(x,y)\) between out-of-sample individuals

conservative

liberal

poles:

partial states aligning with extreme opposing worldviews

  • Compare across time and different GSS surveys
  • Derived features for individuals (ideology index)
I(x) = \frac{\theta(x,\psi_+) - \theta(x,\psi_-)}{\theta(\psi_+,\psi_-)}

Predict 2016 votes using ideology index

Emergent global structure

Reflexivity and State Collapse on Observation

state collapse

strongly agree

 agree

neutral

 disagree

strongly disagree

strongly agree

 agree

neutral

 disagree

strongly disagree

Query/

Observation

\(X_i\)

Non-local Influence propagation on measurement/observation (QM-like)

\phi^i(\psi^{-i})

Psychohistory: A Mathematical Theory of Passive Persuasion

Local Potential Fields

Stable

(captured by local extrema)

Free to move locally towards extrema

Why propaganda works so well

* “Exposure to opposing views on social media can increase political polarization”
by Christopher A. Bail et al., published in PNAS in September 2018 (Vol. 115, No. 37, pp. 9216–9221; DOI: 10.1073/pnas.1804840115)

GSS 2018 individuals and  neighborhoods

Influenza C :  strains and their neighborhoods

Even random perturbations will tend to move individuals towards local extrema increasing polarization

*

  • Polarization is "easy", can occur via random perturbations (falling into the local well)

Hypotheses

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Implications on Social Theory

The LSM tells the latent opinion "space-time" how to curve, the curved "space-time" tells opinions how to change.

Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution

  • De-polarization is "hard", needs specific communication (climbing up from the well)

Data Sufficiency  via Conservation of Complexity

%K(x) = K(S) + K(x \vert S_\star) + O(1) = K(S') + K(x \vert S'_\star) +O(1) K(x \vert S_\star) = O(1) = K(S \vert x_\star)

The No-cheating Thorem: Generative models cannot cheat on complexity

Kolmogorov Complexity

Optimal Generative Model

compressed data representation

compressed model representation

Theorem

K(\textrm{data}) = K(\textrm{LSM}) +O(1)

Conservation Law arising from the continuous symmetry of typicality*

\mu_0(X) \triangleq \frac{\delta(\vert \langle S(X) \rangle \vert)}{\delta(\vert \langle X \rangle \vert)} \leq 1

Saturation relation:

Data Sufficiency Statistic \(\mu_0\)

We need LSM-sampling to calculate this

*Noether's Theorem

For every continuous symmetry of a physical system, there exists a corresponding conserved quantity

\vert \langle X' \rangle \vert \approx \max\{1,\mu_0(X)\} \vert \langle X \rangle \vert

How much more data do we need?

Data saturation

Data deficient

Needed

Current

Empirical Validation

Model Drift Quantification

Ergodic dispersion

\Delta_\star = \theta(\Psi,\psi_\star)
z(\Delta_\star) = \frac{\Delta_\star^{[t]} - \langle \Delta_\star^{[t]} \rangle}{\sigma(\Delta_\star^{[t]} )}

Z-value of dispersion

Do new samples (survey respondents) still conform to the model?

GSS Model drift

ergodic projection (all missing values)

A random belief state (with possibly missing entries)

random variable

normal variate

\zeta(M) = \vert z(\Delta_\star^0) - z(\Delta_\star^{[t]}) \vert

Model drift stochastic process (\(\zeta\))

\mathbf{E}(\zeta(M) )

assess if \(\zeta\) is stationary: if not then new samples are not conforming to model

Example for GSS LSM inferred for year 2000

Large Science Models: Broader Applications

A General Framework for modeling Complex Systems

Genomic database: Missing heritability problem

Personalized Clinical Digital Twin, Virtual Patients

Any structured interview, PTSD fabrication

Assess sysmptom data and co-pathologies

Predict future mutations; which animal strain is closest to jumping to humans

Mental health diagnosis

Microbiome Analysis**

Algorithmic lie detector

Viral emergence

Teomims

Opinion Dynamics

Darkome

Generative model of complex microbial ecosystems, and their impact on health and disease

Data requirements

  • Tabular data
  • Potentially large number of features/covariates (\(10^2 - 10^8 \))
  • Sufficient number of samples (\(10^3 - 10^6\))
  • Small number of longitudinal samples (currently, \( < 100\))

LSMs for complex systems

**published (https://www.science.org/doi/10.1126/sciadv.adj0400)

ishanu_ch@uky.edu

Rotaru, Victor, Yi Huang, Timmy Li, James Evans, and Ishanu Chattopadhyay. "Event-level prediction of urban crime reveals a signature of enforcement bias in US cities." Nature human behaviour 6, no. 8 (2022): 1056-1068.

Large Science Models for Scientific Discovery

By Ishanu Chattopadhyay

Large Science Models for Scientific Discovery

Large Science Models for Scientific Discovery

  • 121