Ishanu Chattopadhyay PRO
ML | Data Science Biomedical Informatics | Social Science | Assistant Professor
Ishanu Chattopadhyay, PhD
Assistant Professor of Biomedical Informatics & Computer Science
University of Kentucky
quantized output levels
*Brook, D. (1964). On the distinction between the conditional probability and the joint probability distribution. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 295–307.
Hundreds of thousands to 10s of millions of features
The Goal: Create a digital twin which can reveal valid perturbations
Completely uninformative state
Observed state
?
Bacilli 30
typical
deficit
Coriobacteria 32
typical
deficit
Gammaproteobacteria 32
typical
deficit
All Patients
Feeding Variables added
Building classifier based on LSM metric
No! The LSM indicates that supplantations need to be patient specific
No transplantation is guaranteed to work reliably
Predicted to reduce
risk reliably
Predicted to reduce
risk reliably
Typical
Deficit
Dataset from Metabolomics Workbench
| Study ID | ST000923 |
|---|---|
| Study Title | Longitudinal Metabolomics of the Human Microbiome in Inflammatory Bowel Disease |
| Institute | Broad Institute of MIT and Harvard |
|---|---|
| Last Name | Avila-Pacheco |
| First Name | Julian |
| Submit Date | 2017-11-14 |
| Num Groups | 3 |
| Total Subjects | 546 |
| Num Males | 276 |
| Num Females | 270 |
| Analysis Type Detail | LC-MS |
State-of-art microbiome based Classification (~10 species) *
| IBD vs UC | 0.82 |
| IBD vs CD | 0.76 |
*Zheng, J., et al. (2024). Noninvasive, microbiome-based diagnosis of inflammatory bowel disease. Nature Medicine, 30(12), 3555–3567. https://doi.org/10.1038/s41591-024-03280-4
| IBD vs non IBD | 0.85 |
Gut-Metabolome based Classification (~36 metabolites) *
Application 2
1. How will proposer form and maintain a computationally tractable LSM tree structure given, as proposed, hundreds to thousands of observable variables?
\(\checkmark\)
https://34.66.189.202/data/trees_mbol_UC
LSM model
| AUC (out of sample) | |
|---|---|
| Healthy vs IBD | 96.1% |
| Healthy vs UC | 92% |
| UC vs CD | 99% |
| Healthy vs CD | 99% |
LSM model
| AUC (out of sample) | |
|---|---|
| Healthy vs IBD | 96.1% |
| Healthy vs UC | 92% |
| UC vs CD | 99% |
| Healthy vs CD | 99% |
| Number of metabolites | 5,280 | 85% untargeted metabolites |
|---|---|---|
| Number of parameters | 4,002,306 | |
| Average Tree Depth | 38.62 | |
| Number of constraints inferred | 108,884 | ~83% involve untargeted metabolites |
| Number of samples used | 180 | (no clinical phenotype information used) |
Application 3
sample profile of new patient \(y\)
Need one patient!
| AUC | Sensitivity at 95% spec | |
|---|---|---|
| LSM | 92.7% | 74% |
| MCHAT/F | 67% | 39% |
| ADOS-2 | 90-97% | 85% |
getting close to the gold standard
1 false positive
1 false negative
10% flag in TBD (expected 8.3% positives)
Patient specific driver profile
Average driver profile
Application 4
LSM model for healthy profiles
Any profile generated by \(\mathcal{H}\) is a healthy profile, while they might be different from one another
average healthy profile
A New Paradigm of AI-driven Discovery in Metabolome Biology
*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.
Revealing Emergent Cross-talk between mutations in a viral protein (Influenza A HA)
Component predictor (Conditional Inference Tree*)
Example: Influenza A HA protein
where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.
This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics
(Sanov's Theorem, Pinkser's Inequality)
\(\psi\)
\(\psi'\)
\(\theta\)
"spatial average": average of all plausible worldviews or states
* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400. https://www.science.org/doi/full/10.1126/sciadv.adj0400
persistence probability
Central to Model Drift Quantification
Start with opinion vector with all entries missing
This is a standard Physics construct, quantifying curvature of the underlying latent geometry
Easily computable in LSM framework!
Apply \(\phi^i\)
Random variable quantifying dispersion around the spatial average of worlviews
const. scaling as \(N^2\)
Q-Net
recursive forest
Data inference boundaries & limitations
Alignment validation
Complex phenomena
Adaptation to model obsolence
Precise validation protocols to assess process drift triggering re-calibration/training
Built-in flexibility for changing contexts and non-ergodicity
Scalable to thousands to millions of variables, intrinsic reflexivity
Component LSM predictors enforce statistical significance of splits in recursive partitioning, ensuring precise uncertainty quantification
*Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. "Unbiased recursive partitioning: A conditional inference framework." Journal of Computational and Graphical statistics 15, no. 3 (2006): 651-674.
emergent macro-structure
Component predictor (Conditional Inference Tree*)
Example: Influenza A HA protein
Recursive
LSM
forest
Revealing Emergent Cross-talk
| reliten | gunlaw | abany | --- | grass | |
|---|---|---|---|---|---|
| Person 1 | |||||
| Person 2 | |||||
| --- | |||||
| Person m |
observables
samples
Distributions over alphabet \(\Sigma^i\)
Individual Predictor (CIT)
cross-talk
Tension between predicted and observed distribution drives change
Example
GSS topic: There should be more gun-control
\(\psi^i\)
| strongly agree | agree | neutral | disagree | strongly disagree |
\(\phi\) estimates \(\psi\)
Examples: GSS, ANES, WVS, ESS, Eurobarometer, Afrobarometer, Asian Barometer etc
group
individual
estimate is always a non-empty non-degenerate distribution
missing observation
where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.
This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics
(Sanov's Theorem, Pinkser's Inequality)
\(\psi\)
\(\psi'\)
\(\theta\)
"spatial average": average of all plausible worldviews or states
* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400. https://www.science.org/doi/full/10.1126/sciadv.adj0400
persistence probability
Central to Model Drift Quantification
Start with opinion vector with all entries missing
This is a standard Physics construct, quantifying curvature of the underlying latent geometry
Easily computable in LSM framework!
Apply \(\phi^i\)
Random variable quantifying dispersion around the spatial average of worlviews
const. scaling as \(N^2\)
Sample predicted distributions
perturbed state within \(\epsilon\) of \(\psi\)
Definition
Sample neighborhood to impute missing data
}
LSM sampling: sampling the \(\epsilon\)-neighborhood of a state or worldview allows reconstruction of censored opinions
Predictive ability of LSM quantified as ability to reconstruct censored out-of-sample observations
Null state (all missing observations)
Valid perturbations/ simulations
LSM sampling allows simulating opinion perturbations
2018 GSS
Polar separation over time
2016 Presidential Election Vote Prediction
2004
| abany | no | yes |
| abdefctw | always wrong | not wrong at all |
| abdefect | no | yes |
| abhlth | no | yes |
| abnomore | no | yes |
| abpoor | no | yes |
| abpoorw | always wrong | not wrong at all |
| abrape | no | yes |
| absingle | no | yes |
| bible | inspired word | book of fables |
| colcom | fired | not fired |
| colmil | not fired | not allowed |
| comfort | strongly agree | strongly disagree |
| conlabor | hardly any | a great deal |
| godchnge | believe now, always have | don't believe now, never have |
| grass | not legal | legal |
| gunlaw | oppose | favor |
| intmil | very interested | not at all interested |
| libcom | remove | not remove |
| libmil | not remove | remove |
| maboygrl | true | false |
| owngun | yes | no |
| pillok | agree | strongly agree |
| pilloky | strongly disagree | strongly agree |
| polabuse | no | yes |
| pray | several times a day | never |
| prayer | disapprove | approve |
| prayfreq | several times a day | never |
| religcon | strongly disagree | strongly agree |
| religint | strongly disagree | strongly agree |
| reliten | strong | no religion |
| rowngun | yes | no |
| shotgun | yes | no |
| spkcom | not allowed | allowed |
| spkmil | allowed | not allowed |
| taxrich | about right | much too low |
conservative pole
liberal pole
Clustering LSM distance \(\theta(x,y)\) between out-of-sample individuals
conservative
liberal
poles:
partial states aligning with extreme opposing worldviews
Predict 2016 votes using ideology index
Emergent global structure
Define Lagrangian*
Via the Euler-Lagrange Equations\(^\dag\):
Over-damped Gradient flow Equation*
where \(-g^{km}\) is the inverse metric tensor
kinetic energy
state collapse
strongly agree
agree
neutral
disagree
strongly disagree
strongly agree
agree
neutral
disagree
strongly disagree
\(X_i\)
potential energy
* Einstein notation used
Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.
\(^\dag\)
Principle of stationary action
Local potential field eqn
Stable
(captured by local extrema)
Free to move locally towards extrema
GSS 2018 individuals and neighborhoods
Influenza C : strains and their neighborhoods
Observation: This lineage (Mississippi lineage) is now extinct since 2022/23
stable lineage
Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution
The No-cheating Thorem: Generative models cannot cheat on complexity
Kolmogorov Complexity
Optimal Generative Model
compressed data representation
compressed model representation
Theorem
Conservation Law arising from the continuous symmetry of typicality*
Saturation relation:
Data Sufficiency Statistic \(\mu_0\)
We need LSM-sampling to calculate this
*Noether's Theorem
For every continuous symmetry of a physical system, there exists a corresponding conserved quantity
How much more data do we need?
Data saturation
Data deficient
Needed
Current
Empirical Validation
Do new samples (survey respondents) still conform to the model?
GSS Model drift
ergodic projection (all missing values)
A random belief state (with possibly missing entries)
random variable
normal variate
assess if \(\zeta\) is stationary: if not then new samples are not conforming to model
Example for GSS LSM inferred for year 2000
\(\checkmark\) 4. Address whether your approach makes assumptions regarding ergodicity, and if so, how these assumptions affect the model's applicability to non-ergodic systems.
No Convergence
(~50% belief mismatch between pairs)
2018 GSS survey belief vectors simulated via LSM sampling
When applied to Social Modeling and Opinion Dynamics
Belief about topic iii is expected to align with beliefs about other topics \(\displaystyle\psi^{-i}\).
Deviations are exponentially improbable \(\Rightarrow \) people/groups seek internal coherence.
Theory Link:
Cognitive consistency theory – Abelson et al. (1968)
Constraint satisfaction in beliefs – Read & Marcus-Newhall (1993)
Beliefs evolve to minimize tension between actual state and “expected” state.
Reflexive gradient flow — system reduces internal contradiction.
Theory Link:
Cognitive Dissonance Theory – Festinger (1957)
Homeostatic belief adjustment – Gawronski & Strack (2004)
Observing a belief changes it and affects all conditionals.
Direct encoding of feedback loops central to human systems.
Theory Link:
Reflexivity in social systems – Giddens (1984), Soros (1994)
Theory of mind / mutual modeling – Premack & Woodruff (1978)
Validation of Social Theory Questions:
| Exploratory: Belief systems react measurably to exogenous events and shocks |
Exploratory: Cross-dependencies between beliefs have observable effects on societal resilience.
Is Polarization an Inevitable Attractor?
Social Identity Theory vs. Belief Proximity
A General Framework for modeling Complex Systems
Genomic database: Missing heritability problem
Personalized Clinical Digital Twin, Virtual Patients
Any structured interview, PTSD fabrication
Assess sysmptom data and co-pathologies
Predict future mutations; which animal strain is closest to jumping to humans
Mental health diagnosis
Microbiome Analysis**
Algorithmic lie detector
Viral emergence
Teomims
Opinion Dynamics
Darkome
Generative model of complex microbial ecosystems, and their impact on health and disease
Data requirements
| Limitation | Mitigation / Response |
|---|---|
| Conventional time series is currently out-of-scope | Focus on cross-sectional interdependencies and belief geometry; time handled via drift |
| LSMs model statistical interdependence, not causal mechanisms | Use perturbation-based simulations to infer plausible influence pathways |
| Limited by observed belief variables | Integrate multiple surveys; use latent proxies and test sensitivity of digital twins |
| Social theory connections and interpretability may be challenging | Anchor dynamics with theory-driven constructs (e.g., ToM, cognitive dissonance) |
LSMs for complex systems
**preliminary study published (https://www.science.org/doi/10.1126/sciadv.adj0400)
By Ishanu Chattopadhyay
DARPA-EA-25-02-05-MAGICS-PA-025
ML | Data Science Biomedical Informatics | Social Science | Assistant Professor