Large Science Models:

Foundation Models for
Generalizable Insights Into Complex Systems

Ishanu Chattopadhyay, PhD

Assistant Professor of Biomedical Informatics & Computer Science

University of Kentucky

Proposed Concept

  • Develop Foundation models of complex systems with
    • hundreds to thousands of evolving variables with apriori unknown cross-talk
    • no governing equations are know a priori
    • reflexivity: system changes if observed
  • Learn intrinsic system geometry from data
  • Derive  equations of motion with variational principles (stationary action on Lagrangian). 
  • Inference under data sparsity
  • Detect data (in)sufficiency, adapt to model drift
  • Support forward simulation and perturbation analysis
  • Digital twins of individuals & groups of entities

Large Science Models

Full Example  of Hyperlinked Trees

Large Science Models: Mathematical Framework

\begin{aligned} \text{Observables:} \quad & \color{yellow}X = \{x^1, \ldots, x^N\}, \overbrace{x^i \in \Sigma^i}^{\text{finite alphabet}} \\ {\color{gray}\text{Notation:} }\quad & \color{gray} x^{-i} = \{x^j : j \ne i\}\\ \text{Crosstalk:} \quad & \forall i \ P(x^i) = \color{red} f_i(x^{-i}) \\ \text{System state:} \quad & \color{Cyan} \psi = \bigotimes_{i=1}^N \psi^i, \quad \psi^i \in \mathscr{D}(\Sigma^i) \cup \varnothing \\ %\textbf{Degenerate case:} \quad & \psi^i \text{ is a delta distribution (fully observed)} \\ {\color{gray}\text{Notation:} }\quad & \color{gray}\psi^{-i} = \bigotimes_{j \ne i} \psi^j \end{aligned}
reliten  gunlaw abany --- grass
Person 1
Person 2
---
Person m

observables

samples

Distributions over alphabet \(\Sigma^i\)

\phi = \bigotimes_{i=1}^N \phi^i, \quad \phi^i(\psi^{-i}) \in \mathscr{D}(\Sigma^i) \\

Individual Predictor (CIT)

cross-talk

\phi(\psi) \vert \vert \psi

Tension between predicted and observed distribution drives change

Example

GSS topic: There should be more gun-control

\(\psi^i\)

strongly agree agree neutral disagree strongly disagree
\Sigma^i

Digital Twin

\phi^i(\psi^{-i}) \sim \widetilde{\psi}^i

\(\phi\) estimates \(\psi\)

Examples: GSS, ANES, WVS, ESS, Eurobarometer, Afrobarometer, Asian Barometer etc

group

individual

estimate is always a non-empty non-degenerate distribution

missing observation

Large Science Models: Properties

LSM-Distance Metric*

\theta(\psi,\psi') \triangleq \frac{1}{N}\sum_{i=1}^{N} \sqrt{D_{JS}\Bigl(\phi^i(\psi^{-i}) \vert \vert \phi^i(\psi'^{-i})\Bigr)}

 where \(D_{JS}(P\vert \vert Q)\) is the Jensen-Shannon divergence.

g_{ij}(\psi) \;=\; \frac{1}{2}\,\frac{\partial^2}{\partial \psi^i\,\partial \psi^j}\,\theta^2(\psi,\psi')\Biggr|_{\psi'=\psi}
\left \lvert \ln \frac{\Pr(\psi\to \psi')}{\Pr(\psi' \rightarrow \psi')}\right \rvert \le \beta\,\theta(\psi,\psi')

Large Deviation Bound*

 Induced  Riemannian metric tensor

This bound connects ``closeness'' of samples to the odds of perturbing from one to the other, bridging geometry to dynamics

Ergodic Projection

\psi_\star \triangleq \bigotimes_{i=1}^N\phi^i\left (\prod_{1}^{N-1}\varnothing\right )

(Sanov's Theorem, Pinkser's Inequality)

\(\psi\)

\(\psi'\)

\(\theta\)

"spatial average":  average of all plausible worldviews or states

* Sizemore, Nicholas, Kaitlyn Oliphant, Ruolin Zheng, Camilia R. Martin, Erika C. Claud, and Ishanu Chattopadhyay. "A digital twin of the infant microbiome to predict neurodevelopmental deficits." Science Advances 10, no. 15 (2024): eadj0400.  https://www.science.org/doi/full/10.1126/sciadv.adj0400

persistence probability

Ergodic dispersion

\Psi_\star = \theta(\psi,\psi_\star)

Central to Model Drift Quantification

Start with opinion vector with all entries missing

This is a standard Physics construct, quantifying curvature of the underlying latent geometry

Pr(\psi \rightarrow \psi')

Easily computable in LSM framework!

Apply \(\phi^i\)

Random variable quantifying dispersion around the spatial average of worlviews

const. scaling as \(N^2\) 

Digital Twin & Fidelity of Simulation

\mathcal{N}_\epsilon(\psi) \triangleq \big\{ \psi': {\color{red}\forall i \ \psi'_i \sim \phi^i\left ( \psi^{-i}\right )} \wedge {\color{yellow} \theta(\psi,\psi') \leqq \epsilon }\big \}

Sample predicted distributions   

perturbed state within \(\epsilon\) of \(\psi\)

Variable Masked Reconstructed
spkcom allowed allowed
colcom not fired not fired
spkmil allowed allowed
colmil allowed not allowed
libmil not remove not remove
libhomo not remove not remove
reliten strong no religion
pray once a day once a day
bible inspired word word of god
abhlth yes yes
abpoor no no
pillok agree agree
intmil very interested very interested
abpoorw always wrong not wrong at all
godchnge believe now, always have believe now, always have
prayfreq several times a week several times a week
religcon strong disagree disagree
religint disagree disagree
Variable Masked Reconstructed
spkcom allowed allowed
colcom not fired not fired
libmil not remove not remove
libhomo not remove not remove
gunlaw favor favor
reliten no religion no religion
prayer approve approve
bible book of fables inspired word
abnomore yes yes
abhlth yes yes
abpoor yes yes
abany yes yes
owngun no no
intmil moderately interested moderately interested
abpoorw not wrong at all not wrong at all
godchnge believe now, didn't used to believe now, always have
prayfreq several times a week several times a week

2018 GSS individual samples

Digital Twin

-Neighborhood of state \(\psi\)

\epsilon

Definition

Sample neighborhood to impute missing data

\psi
\epsilon
}

2018 GSS  out-of-sample reconstruction

post-reconstruction error ratio (%)

LSM sampling: sampling the \(\epsilon\)-neighborhood of a state or worldview allows reconstruction of censored opinions

examples

Predictive ability of LSM quantified as ability to reconstruct censored out-of-sample opinions**

{\color{Tomato}\psi_\star }\rightarrow \psi \rightarrow \cdots \rightarrow \psi'

Null state (all missing observations)

Valid perturbations/ simulations

LSM sampling allows simulating opinion perturbations

Both Individuals and groups maybe modeled as digital twins\(\dag\)

Global Emergent Structure via Clusters & Poles

2018 GSS

\theta_t(\psi_+,\psi_-)

Polar separation over time

2016 Presidential Election Vote Prediction

2004

abany no yes
abdefctw always wrong not wrong at all
abdefect no yes
abhlth no yes
abnomore no yes
abpoor no yes
abpoorw always wrong not wrong at all
abrape no yes
absingle no yes
bible inspired word book of fables
colcom fired not fired
colmil not fired not allowed
comfort strongly agree strongly disagree
conlabor hardly any a great deal
godchnge believe now, always have don't believe now, never have
grass not legal legal
gunlaw oppose favor
intmil very interested not at all interested
libcom remove not remove
libmil not remove remove
maboygrl true false
owngun yes no
pillok agree strongly agree
pilloky strongly disagree strongly agree
polabuse no yes
pray several times a day never
prayer disapprove approve
prayfreq several times a day never
religcon strongly disagree strongly agree
religint strongly disagree strongly agree
reliten strong no religion
rowngun yes no
shotgun yes no
spkcom not allowed allowed
spkmil allowed not allowed
taxrich about right much too low
     

conservative pole

\psi_+

liberal pole

\psi_-

Clustering LSM distance \(\theta(x,y)\) between out-of-sample individuals

conservative

liberal

poles:

partial states aligning with extreme opposing worldviews

  • Compare across time and different GSS surveys
  • Derived features for individuals (ideology index)
I(x) = \frac{\theta(x,\psi_+) - \theta(x,\psi_-)}{\theta(\psi_+,\psi_-)}

Predict 2016 votes using ideology index

Emergent global structure

Reflexivity and State Collapse on Observation

Emergent Equations of Motion

L \triangleq \frac{1}{2} \sum_i g_{kl} P^k_p \dot{\psi}^p_i P^l_n \dot{\psi}^n_i - \theta(\psi, \phi)

Define Lagrangian*

\frac{d}{dt} \left( \frac{\partial L}{\partial \dot{\psi}^m_i} \right) - \frac{\partial L}{\partial \psi^m_i} = 0

Via the Euler-Lagrange Equations\(^\dag\):

\ddot{\psi}^m_i = -g^{km} P^k_m \frac{1}{2N} \sum_j \frac{1}{\sqrt{D_{JS}(\psi^m_j \| \phi^m_j)}} \left[ \ln\left( \frac{2e\psi^m_j}{\psi^m_j + \phi^m_j} \right) - \frac{1}{2(\psi^m_j + \phi^m_j)} \right]

Over-damped Gradient flow Equation*

where \(-g^{km}\) is the inverse metric tensor

kinetic energy

state collapse

strongly agree

 agree

neutral

 disagree

strongly disagree

strongly agree

 agree

neutral

 disagree

strongly disagree

Query/

Observation

\(X_i\)

Non-local Influence propagation on measurement/observation (QM-like)

\phi^i(\psi^{-i})

potential energy

* Einstein notation used

Goldstein, Herbert, et al. Classical Mechanics. 3rd ed., Pearson, 2002.

\(^\dag\)

Principle of stationary action

Dynamics

Local potential field eqn

Local Potential Fields

Stable

(captured by local extrema)

Free to move locally towards extrema

Why propaganda works so well

* “Exposure to opposing views on social media can increase political polarization”
by Christopher A. Bail et al., published in PNAS in September 2018 (Vol. 115, No. 37, pp. 9216–9221; DOI: 10.1073/pnas.1804840115)

GSS 2018 individuals and  neighborhoods

Influenza C :  strains and their neighborhoods

Even random perturbations will tend to move individuals towards local extrema increasing polarization

*

  • Polarization is "easy", can occur via random perturbations (falling into the local well)

Hypotheses

Observation: This lineage (Mississippi lineage) is now extinct since 2022/23

stable lineage

Implications on Social Theory

The LSM tells the latent opinion "space-time" how to curve, the curved "space-time" tells opinions how to change.

Local potential fields can be computed given the LSM and dynamical considerations, which reveal future evolution

  • De-polarization is "hard", needs specific communication (climbing up from the well)

Data Sufficiency  via Conservation of Complexity

%K(x) = K(S) + K(x \vert S_\star) + O(1) = K(S') + K(x \vert S'_\star) +O(1) K(x \vert S_\star) = O(1) = K(S \vert x_\star)

The No-cheating Thorem: Generative models cannot cheat on complexity

Kolmogorov Complexity

Optimal Generative Model

compressed data representation

compressed model representation

Theorem

K(\textrm{data}) = K(\textrm{LSM}) +O(1)

Conservation Law arising from the continuous symmetry of typicality*

\mu_0(X) \triangleq \frac{\delta(\vert \langle S(X) \rangle \vert)}{\delta(\vert \langle X \rangle \vert)} \leq 1

Saturation relation:

Data Sufficiency Statistic \(\mu_0\)

We need LSM-sampling to calculate this

*Noether's Theorem

For every continuous symmetry of a physical system, there exists a corresponding conserved quantity

\vert \langle X' \rangle \vert \approx \max\{1,\mu_0(X)\} \vert \langle X \rangle \vert

How much more data do we need?

Data saturation

Data deficient

Needed

Current

Empirical Validation

Copy of Copy of LSM

By Ishanu Chattopadhyay

Copy of Copy of LSM

DARPA-EA-25-02-05-MAGICS-PA-025

  • 9