Ishanu Chattopadhyay PRO
ML | Data Science Biomedical Informatics | Social Science | Assistant Professor
Ishanu Chattopadhyay, PhD
Assistant Professor of Biomedical Informatics & Computer Science
University of Kentucky
The Laboratory for Zero-knowledge Discovery
ishanu_ch@uky.edu
09.10.2025
first wave
rule-based systems
(1950s-1980s)
second wave
Big Data / ML / Deep Learning
recognize patterns, make predictions, might improve over time, but struggle on tasks not trained for
(1990s - 2010s)
third wave
contextual reasoning, generalizable models, - towards human-like or trans-human intelligence
(2020s - )
Stamping Out the Next Pandemic **Before** The First Human Infection
Assuming a 1000 species ecosystem, and 1 successful experiment every day
to discern a single two-way relationship,
we would need 1,368 years to go through all possibilities.
How do we explore the darkome?
Autism heritability: 80-90%
GWAS explains: 5-10%
Chattopadhyay, Ishanu, Kevin Wu, Jin Li, and Aaron Esser-Kahn. "Emergenet: Fast Scalable Pandemic Risk Assessment of Influenza A Strains Circulating In Non-human Hosts." (2023). Under Review
PREEMPT
predict future emergence risk
Hemaglutinnin
Neuraminidase
Mediates Cellular Entry
Surface structures involved in host interaction
Mediates Cellular Exit
*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.
emergent macro-structure
Component predictor (Conditional Inference Tree*)
Example: Influenza A HA protein
Recursive
LSM
forest
Revealing Emergent Cross-talk from observed sequence variations
>200,000 HA sequences
H3
Northern Hemisphere
2021
Influenza A: HA
H5
2013
Influenza A: HA
| 222 | 223 | 224 | --- | 560 | |
|---|---|---|---|---|---|
| strain 1 | |||||
| strain 2 | |||||
| --- | |||||
| strain m |
observables
samples
Distributions over alphabet \(\Sigma^i\)
cross-talk
Example: HA Site 223 on Influenza A
\(\psi^i\)
| K | G | Y | S | T |
population
individual
missing observation
Individual Predictor (CIT)
Tension between predicted and observed distribution drives change
estimate is always a non-empty non-degenerate distribution
\(\phi\) estimates \(\psi\)
where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.
(Sanov's Theorem, Pinkser's Inequality)
persistence probability
This is a standard Physics construct, quantifying curvature of the underlying latent geometry
const. scaling as \(N^2\)
\(\psi\)
\(\psi'\)
\(\theta\)
xx
"spatial average": average of all plausible worldviews or states
* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400. https://www.science.org/doi/full/10.1126/sciadv.adj0400
Central to Model Drift Quantification
Start with opinion vector with all entries missing
Easily computable in LSM framework!
Apply \(\phi^i\)
Random variable quantifying dispersion around the spatial average of samples
Sample predicted distributions
perturbed state within \(\epsilon\) of \(\psi\)
Definition
Sample neighborhood to impute missing data
}
LSM sampling: sampling the \(\epsilon\)-neighborhood of a strain reveals local "valid perturbations"
Null state (all missing observations)
Valid perturbations/ simulations
Influenza Risk Assessment Tool (IRAT) scoring for animal strains
slow (months), quasi-subjective, expensive
*https://www.cdc.gov/flu/pandemic-resources/monitoring/irat-virus-summaries.htm
24 scores in 14 years
~10,000 strains collected annually
CDC
Emergenet time: 1 second
Stamping Out the Next Pandemic **Before** The First Human Infection
BioNorad
BioNorad
Via the Euler-Lagrange Equations\(^\dag\):
Define Lagrangian*
Over-damped Gradient flow Equation*
where \(-g^{km}\) is the inverse metric tensor
kinetic energy
potential energy
* Einstein notation used
Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.
\(^\dag\)
Principle of stationary action
Question:
Why has the Mississippi lineage of Influenza C vanished from human circulation recently, while other lineages continue to exist?
William Robert Mills Chair in Equine Infectious Diseases
H0
H1
M0
H0
H1
M0
bovine strains are closest to this
Distance of bovine sequences to M0 cluster
'C/Miyagi/2/94', 'C/Saitama/2/2000', 'C/Yamagata/3/2000', 'C/Miyagi/7/93', 'C/Miyagi/4/96', 'C/Saitama/1/2004', 'C/Miyagi/7/96', 'C/Greece/1/79', 'C/Yamagata/5/92', 'C/Miyagi/3/93', 'C/Miyagi/4/93', 'C/Kyoto/41/82', 'C/Nara/82', 'C/Hyogo/1/83', 'C/Miyagi/1/94', 'C/Miyagi/6/93', 'C/Miyagi/3/94', 'C/Mississippi/80', 'C/Yamagata/26/2004', 'C/Mississippi/80'
Mississippi Lineage
Suggests movement from M0 to H0 to H1
| M0 | -64.251 |
|---|---|
| H0 | -32.586 |
| H1 | -15.964 |
8 75 87 97 141 154 165 178 181 183 203 205 211 216 230 252 327 361 506 588
Stable
(captured by local extrema)
Free to move locally towards extrema
Observation: This lineage (Mississippi lineage) is now extinct since 2022/23
stable lineage
Define Lagrangian\(\dag\)
Over-damped Gradient flow Equation\(\dag\)
where \(-g^{km}\) is the inverse metric tensor
kinetic energy
potential energy
Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.
\(^\dag\)
Principle of stationary action
Local potential field eqn
THE PROBLEM
Assuming a 1000 species ecosystem, and 1 successful experiment every day to discern a single two-way relationship, we would need 1,368 years to go through all possibilities.
Digital Twin for the Maturing Human Microbiome
Boston U
U Chicago
Two centers
Ability to "fill in" missing data is equivalent to making trajectory forecasts
predicting neurodevelopmental deficits
forecasting ecosystem trajectories
Phase 1
Phase 2
PREPARE: Pioneering Research for Early Prediction of Alzheimer's and Related Dementias EUREKA Challenge
Algorithm for early diagnosis
Find Data for early prediction
Phase 1
Phase 2
Second Prize 40,000 USD
Lets give them:
licensed patient data
digital twin
(generative AI)
teomims
(open cohort)
Phase 1
Phase 2
Uncorrelated, yet indistinguishable !!
Can A Generative AI Tell if you Are Lying?
VeRITaAS
Vetting Response Integrity from
cross-Talk in Adversarial
Surveys
Data set: 300 participants with confirmed PTSD answering a 211 question set based on PCL5+
LSM
Hidden structure of cross-talk between responses to interview items
PTSD diagnostic interview
Number of possible responses
Minimum Performance (n=624)
Average Time: 3.5 min
No. of questions: 20
AUC > 0.95
PPV > 0.86
NPV > 0.92
At least 83.3% sensitivity at 94% specificity
Minimum AUC = \(0.95 \pm 0.005\)
Cannot be coached, or memorized
Datasets for training & validation
1. VA (n=294)
2. Prolific (n=300)
3. Psychiatrists (n=30)
Beat the test!
200 participants in
US
100 participants in
UK
30 forensic psychiatrists
10
6
1
Can-You-Fake-PTSD Challenge Results
successful attempts
2018 GSS
Polar separation over time
2016 Presidential Election Vote Prediction
2004
| abany | no | yes |
| abdefctw | always wrong | not wrong at all |
| abdefect | no | yes |
| abhlth | no | yes |
| abnomore | no | yes |
| abpoor | no | yes |
| abpoorw | always wrong | not wrong at all |
| abrape | no | yes |
| absingle | no | yes |
| bible | inspired word | book of fables |
| colcom | fired | not fired |
| colmil | not fired | not allowed |
| comfort | strongly agree | strongly disagree |
| conlabor | hardly any | a great deal |
| godchnge | believe now, always have | don't believe now, never have |
| grass | not legal | legal |
| gunlaw | oppose | favor |
| intmil | very interested | not at all interested |
| libcom | remove | not remove |
| libmil | not remove | remove |
| maboygrl | true | false |
| owngun | yes | no |
| pillok | agree | strongly agree |
| pilloky | strongly disagree | strongly agree |
| polabuse | no | yes |
| pray | several times a day | never |
| prayer | disapprove | approve |
| prayfreq | several times a day | never |
| religcon | strongly disagree | strongly agree |
| religint | strongly disagree | strongly agree |
| reliten | strong | no religion |
| rowngun | yes | no |
| shotgun | yes | no |
| spkcom | not allowed | allowed |
| spkmil | allowed | not allowed |
| taxrich | about right | much too low |
conservative pole
liberal pole
Clustering LSM distance \(\theta(x,y)\) between out-of-sample individuals
conservative
liberal
poles:
partial states aligning with extreme opposing worldviews
Predict 2016 votes using ideology index
Emergent global structure
state collapse
strongly agree
agree
neutral
disagree
strongly disagree
strongly agree
agree
neutral
disagree
strongly disagree
\(X_i\)
Stable
(captured by local extrema)
Free to move locally towards extrema
Why propaganda works so well
* “Exposure to opposing views on social media can increase political polarization”
by Christopher A. Bail et al., published in PNAS in September 2018 (Vol. 115, No. 37, pp. 9216–9221; DOI: 10.1073/pnas.1804840115)
GSS 2018 individuals and neighborhoods
Influenza C : strains and their neighborhoods
Even random perturbations will tend to move individuals towards local extrema increasing polarization
*
Observation: This lineage (Mississippi lineage) is now extinct since 2022/23
stable lineage
The LSM tells the latent opinion "space-time" how to curve, the curved "space-time" tells opinions how to change.
Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution
The No-cheating Thorem: Generative models cannot cheat on complexity
Kolmogorov Complexity
Optimal Generative Model
compressed data representation
compressed model representation
Theorem
Conservation Law arising from the continuous symmetry of typicality*
Saturation relation:
Data Sufficiency Statistic \(\mu_0\)
We need LSM-sampling to calculate this
*Noether's Theorem
For every continuous symmetry of a physical system, there exists a corresponding conserved quantity
How much more data do we need?
Data saturation
Data deficient
Needed
Current
Empirical Validation
Do new samples (survey respondents) still conform to the model?
GSS Model drift
ergodic projection (all missing values)
A random belief state (with possibly missing entries)
random variable
normal variate
assess if \(\zeta\) is stationary: if not then new samples are not conforming to model
Example for GSS LSM inferred for year 2000
A General Framework for modeling Complex Systems
Genomic database: Missing heritability problem
Personalized Clinical Digital Twin, Virtual Patients
Any structured interview, PTSD fabrication
Assess sysmptom data and co-pathologies
Predict future mutations; which animal strain is closest to jumping to humans
Mental health diagnosis
Microbiome Analysis**
Algorithmic lie detector
Viral emergence
Teomims
Opinion Dynamics
Darkome
Generative model of complex microbial ecosystems, and their impact on health and disease
Data requirements
LSMs for complex systems
**published (https://www.science.org/doi/10.1126/sciadv.adj0400)
ishanu_ch@uky.edu
Rotaru, Victor, Yi Huang, Timmy Li, James Evans, and Ishanu Chattopadhyay. "Event-level prediction of urban crime reveals a signature of enforcement bias in US cities." Nature human behaviour 6, no. 8 (2022): 1056-1068.
By Ishanu Chattopadhyay
Large Science Models for Scientific Discovery
ML | Data Science Biomedical Informatics | Social Science | Assistant Professor