Yuan-Sen Ting (丁源森)
The Ohio State University
Expediting Discoveries in Astronomy with A.I. Agents
NSF awarded over $200 million for AI Research Institutes
~ 2 centers
~ 2 centers
Physical Sciences
~ 3 centers
7 centers x 15M ~ 100M
Environmental Sciences
Biological Sciences

Hype, myth, or real deal?

Why hasn't astronomy had its
"AlphaFold" moment yet?"

YST, Annual Review of Astronomy and Astrophysics, arXiv: 2510.10713
Most AI in Astronomy focuses on extending statistical methods
Most AI in Astronomy focuses on extending statistical methods
0.9
0.8
0.7
0.25
0.30
0.35
0.40


Dark Matter Density
Growth Amplitude
E.g.,
simulation-based
inferences
Sihao Cheng, YST+, 2020
Applying A.I. to individual tasks
will have limited impacts in astrophysics
The complexity of astronomy is too low for AI

My niece
Highly non-Gaussian

Weakly non-Gaussian
Cosmic large-scale structure
Astronomy is not biology
Data / Observation
Theory / Hypothesis
Analysis Pipelines

True
False


Biology faced fundamental bottlenecks from individual tasks
Data / Observation
Theory / Hypothesis
Analysis Pipelines

True
False



Alphafold
Most astronomical tasks already have working heuristics
Data / Observation
Theory / Hypothesis
LamdaCDM

True
False



Toward Agentic Research for Astronomy
Data
Theory



State of the research

Making "plans"
Harness reasoning
Beyond just individual task optimizations


A.I. in Math Olympiads
A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025



In open-world setting, can large language models match human researchers at expediting
data explorations?




??
Can A.I. agents understand spectral data (spectral energy distribution) from JWST?
Real-world reasoning extends far beyond algorithmic formalism

A default fit with
an SED model



Extinction model ?
Real-world reasoning extends far beyond algorithmic formalism


Young stellar population?
Real-world reasoning extends far beyond algorithmic formalism
Many real-world problems aren't simple optimization problems
The objective goes beyond minimizing a single error metric.
Many tasks may require modifying assumptions / physical models, not just optimizing over all parameters
Action spaces are vast and hard to parameterize.
Can a large-language model learn
from its own experience?

Human "intuition" + experience

Introducing Mephisto*
* In the classic tale of Faust, Mephisto is a demon who tempts the scholar Faust with knowledge and power in exchange for his soul.
A collaboration of multiple AI agents (LLM models)

Proposing actions

Execute actions

State evolution

Knowledge distillation
A collaboration of multiple AI agents (LLM models)
Proposing actions
Execute actions
State evolution
Knowledge distillation





Enabling AI to collect "knowledge" through exploration

Knowledge base


1
2
3
4
Proposing Actions - e.g., different physical models / parameter range
Enabling AI to collect "knowledge" through exploration

Knowledge base

1
2
3
4
Execute Actions - write configuration files, run the codes, automously


Enabling AI to collect "knowledge" through exploration

Knowledge base

1
2
3
4

vs.
vs.
vs.
vs.

State Evaluation - evaluate the results (beyond a single error metric)
Enabling AI to collect "knowledge" through exploration

Knowledge base

1
2
3
4

vs.
vs.
vs.
vs.

Knowledge Distillation - summarise useful actions given the previous state
Mephisto - deployed as "walkers" in the action space

Number of Learning Iterations
0
10
20
30
5.1
5.6
6.0
6.4
GPT-4o baseline --
"think without knowledge"
Chi-Square of the Fit
LLMs with self-improvement outperforms native LLMs
Fitting JWST JADES data
Sun, YST+, 2024
Number of Learning Iterations
0
10
20
30
5.1
5.6
6.0
6.4
GPT-4o baseline --
"think without knowledge"
Mephisto

Chi-Square of the Fit

LLMs with self-improvement outperforms native LLMs

Sun, YST+, 2024
Example of learned "knowledge"
" If the fit is overestimated in the UV and optical bands,
increasing the E_BV_lines parameter may lead to a better fit by accounting for more dust attenuation in these bands. "

Sun, YST+, 2025
Mephisto operates as walkers exploring the "hypothesis space"

With COSMOS2020 SEDs
Mephisto finds better solutions using only 1% of the trials that brute force methods require







https://tingyuansen.github.io/NASA_AI_ML_STIG/
https://tingyuansen.github.io/NASA_AI_ML_STIG/


YST+, 2025

Fitting equivalent widths used to require human judgments
YST+, 2025

E.g., deciding whether there's an unresolved blend of lines
YST+, 2025
E.g., adjusting for the continuum
YST+, 2025

What took a trained postdoc six months now costs ~$100 with agents

Liu, YST+ 2024

Zooniverse.org

Agents can sift through hundreds of millions of ASAS-SN
light curves and reason their way to interesting outliers
Pesta & YST, in prep.


Phase
Phase
-0.5
-0.25
0
0.25
0.5
-0.5
-0.25
0
0.25
0.5
Magnitude
11
12
13
14
15
12
13
14
15
16
Caught in the brief, unstable evolutionary semi-detached phase
A rare alignment of
a massive Supergiant
in a 13-year orbit
P=13 years
P=2.3 days

Graduate student / Postdoc
The Plot Twist
A.I. still struggles with many tasks that are easy for humans

Princeton Language and Intelligence Lab, June 2024

Human accuracy: ~80%
GPT-4o: ~47%
Can A.I. reason about scientific charts?
ARC Prize Foundation (ARC-AGI-2, 2025)
Spatial Pattern Reasoning

Human Panel : ~ 100%
GPT-5 : ~10%
Moravec's Paradox (1988)
- Things that seem easy for humans might be hard for computer, and vice versa
Reversing the evolution of "intelligence"


Evolution Timeline: What came first vs. last
Conversational abilities
are the easiest to imitate

A lot of our holistic abilities were developed much earlier

Easy-for-AI
Complex calculations
Easy-for-Human
Logical inference (?)
Memorizing information
Language
Coding
Spatial reasoning
Common sense physics
(water flows downhill)
Basic motor skills
Visual reasoning
Understanding context

A.I. in Astronomy Olympiads


Pinheiro, ..., YST+, 2025
Visual reasoning remains a limiting factor for AI agents

Pinheiro, ..., YST+, 2025

Yang,... YST+, 2025, ICCV
Visualizing the knowledge graph in astronomy
Sun, YST+, 2024b
Visualizing the knowledge graph in astronomy
Li, YST+, 2026


de Haan, YST+, 2025
Score (%)
Cost per 1 SED Source (USD)

AstroSage-8B
(de Haan, YST+ 2025a)
AstroSage-70B
(de Haan, YST+ 2025b)
For astronomy Q&A, AstroSage-70B delivers GPT-5-level performance while costing 20x less
de Haan, YST+, 2024, 2025
AI-capable tasks are getting exponentially cheaper

de Haan, YST+, 2025
ARC Prize Foundation (ARC-AGI-2, 2025)
Spatial Pattern Reasoning

Human Panel : ~ 100%
GPT-5 : ~10%
Note:
Gemini 3.1, GPT-5.4 : ~ 85%

rethinkingscholarship.org
Epistemology: What counts as knowledge?
Supported by the Alfred P. Sloan Foundation and CCAPP / OSU

YST, Curtis-Trudel & Yao, 2026, Nature Astronomy


casper-osu.com
Easy-for-AI
Complex calculations
Easy-for-Human
Logical inference (?)
Memorizing information
Language
Coding
Spatial reasoning
Common sense physics
Basic motor skills
Visual reasoning
Understanding context

"One particularly useful conception of understanding emphasizes several interconnected capacities: characterizing the features of a system, communicating those characteristics so that others can mentally reconstruct them .... "
"... On this way of thinking, understanding is a matter of making the world intelligible to communities of inquirers. "
YST, Curtis-Trudel & Yao, 2026, Nature Astronomy
Narrative matters, rheotoric matters, context matters

Ernest Hemingway's six-word story
For sale:
Baby shoes,
Never worn.
"Scientific understanding in complex domains shares something of this character.
The ‘knowledge’ encoded in a successful model of galaxy formation
is not fully captured by its equations or even its predictions; it includes the tacit understanding of which features matter, why they matter, and how they connect to the broader enterprise of astronomy."
YST, Curtis-Trudel & Yao, 2026, Nature Astronomy

"We may discover what astronomy has always tacitly known: that understanding the universe is a distinctly human project—even when, especially when, we have non-human collaborators in the endeavour."
YST, Curtis-Trudel & Yao, 2026, Nature Astronomy
Extra Slides
Annotated
Labelled Data
supervised
tasks
Unlabelled Data
foundational models
Interacting with "physical" world
AI
astronomer
Example of learned "knowledge"
" If there is a gross underestimation in the MWIR bands,
consider exploring a wider range of fracAGN values in the agn module to improve the fit in these bands "

Number of Learning Iterations
0
10
20
30
5.1
5.6
6.0
6.4
Chi-Square
Chi-Square of the Fit
Why this plateau ??
Sun, YST+, 2024
LLMs with self-play RL outperforms native LLMs
Number of Learning Iterations
0
10
20
30
5.1
5.6
6.0
6.4
Chi-Square
Chi-Square of the Fit
- Number of photometry bands fitted within 1σ
LLMs with self-play RL outperforms native LLMs
Sun, YST+, 2024
Number of Learning Iterations
0
10
20
30
5.1
5.6
6.0
6.4
Chi-Square
Chi-Square of the Fit
- Number of photometry bands fitted within 1σ
"Exploration"
"Exploitation"
LLMs with self-play RL outperforms native LLMs
Sun, YST+, 2024
Explaining James Webb's "little red dot" galaxies with Mephisto

Wavelength [micron]
Flux


Sun, YST+, 2025

A seamless and interpretable AI-human collaboration




Learn from the data
Summarize "knowledge"
Examine and include prior knowledge

A seamless and interpretable AI-human collaboration




Expedite discovery
Use the learned knowledge as context
Quantifying the growth of the field -- by groups of concepts
Year
2000
2005
2010
2015
2020
7
9
11
10
8
Count [thousands]
Scientific concepts
Sun, YST+, 2024b
Quantifying the growth of the field -- by groups of concepts
Year
2000
2005
2010
2015
2020
1.5
Count [thousands]
Numerical simulation
1.2
0.9
0.6
0.3
Statistics
Sun, YST+, 2024b
The number of ML concepts in astronomy has not grown
Year
2000
2005
2010
2015
2020
1.5
Count [thousands]
1.2
0.9
0.6
0.3
Machine learning
Linear Regression,
Gaussian Process, Random Forest, ......
152
210
230
Sun, YST+, 2024b
Quantifying the cross-domain interaction:
How technical concepts inspire scientific ones
Knowledge graph via the literature-citation metric
Concept
Paper
Ting et al.
Contain
Einstein et al.
Contain
Contain
citation
Concept B:
Plasmon
Concept A:
Dark Matter
Concept A:
Dark Matter
Concept
Concept B:
Plasmon
Distance between concept A to B =
Paper
averaged over all papers containing concept A
Knowledge graph via the literature-citation metric
Concept
Paper
Technical concept:
Neural Networks
Scientific concept: Large-Scale Structure
Cross-domain linkage shows a two-phase evolution
Year
2000
2005
2010
2015
2020
-4.0
Log Average Linkage
-4.2
-4.4
-4.6
Numerical simulation
x scientific concepts
Technology development
Sun, YST+, 2024b
Concept
Paper
Scientific Concept: Large-Scale Structure
Numerical Simulations
Simulations being developed
Linkage
decoupled
Cross-domain linkage shows a two-phase evolution
Year
2000
2005
2010
2015
2020
-4.0
Log Average Linkage
-4.2
-4.4
-4.6
Numerical simulation
x scientific concepts
Technology deployment
Technology development
Sun, YST+, 2024b
Concept
Paper
Scientific Concept: Large-Scale Structure
Numerical Simulations
Simulations being deployed to sciences
Linkage increases
Year
2000
2005
2010
2015
2020
-4.0
Log Average Linkage
-4.2
-4.4
-4.6
Numerical simulation
x scientific concepts
N-body
simulation
Hydrodynamical simulation
Cross-domain linkage shows a two-phase evolution
Sun, YST+, 2024b
Interest in AI x Astronomy outpaces technological development
Year
2000
2005
2010
2015
2020
-4.0
Log Average Linkage
-4.2
-4.4
-4.6
ML x Scientific concepts
Gaussian process
multi-layer perceptron
We don't understand how people intuitively understand plots

AI is still 20-50 points worse than humans
Brute force fine-tuning can close the gap in simple descriptive tasks, but not in visual reasoning tasks
Yang,... YST+, 2025, ICCV

YST+, 2025d
Our concepts show finer granularity than keywords

YST+, 2025d
Our concepts show finer granularity than keywords

YST+, 2025d
The temporal evolution of concept
co-occurrences
in papers
Cosmology
Galaxy
High-energy
Sun/Star
Exoplanet
Simulation
Instrument
AI/Stat
Cosmology
Galaxy
High
-energy
Star
Planet
Sims
Instru.
AI/Stats
Sun/Star
Applications of AI in Stats
YST+, 2025d

We also need a capable model that can generate run cost efficiently....

capable model
vs.
cost efficiency
e.g., GPT-5
In the SED case study, we need ~0.1M tokens per source
= USD 1 per source ...




1B sources = $1 billion
e.g., Roman Space Telescope, Euclid Space Telescope
~ approximately the build cost
Can we improve lightweight
open-weights LLMs to perform well on astronomical tasks?


Natural Language Processing experts





Oak Ridge
National Lab

Argonne
National Lab

AstroMLab (astromlab.org)

Harvard-Smithsonian ADS


U. Ilinois
Urbana-Champaign


The first extensive benchmarking effort in astronomy
The first extensive benchmarking effort in astronomy
Knowledge Recall
YST+, 2025a



AstroBench: High quality astronomy QA benchmark dataset


Nguyen, YST+ 2023

Benchmark multiple choice question - example
What is the primary reason for the decline in the number density of luminous quasars at redshifts greater than 5?
A decrease in the overall star formation rate, leading to fewer potential host galaxies for quasars.
An increase in the neutral hydrogen fraction in the intergalactic medium, which obscures the quasars’ light.
A decrease in the number of massive black hole seeds that can form and grow into supermassive black holes.
An increase in the average metallicity of the Universe, leading to a decrease in the efficiency of black hole accretion.
Special thanks to






Beyond just benchmarking astronomical knowledge recall
Warren Buffet :
" The trick is, when there is nothing to do, do nothing "
Still it is not very scalable


LLaMA-3.1 70b throughput on four H100 GPUs


= ~ 100 tokens / second
1 SED source = 15 GPU minutes
1B sources = 10M GPU days
A cluster with 10,000 H100 GPUs
running for 3 years

= 0.03 USD

= 40 USD


Huang's Law
Compute Power

Year
CPU Moore's Law is plateauing
GPU is
picking up the pace

LLMs are getting very cheap, very quickly

The price drop has an e-folding time of appromately
3 months
YST, AstroMLab+, 2025
Score (%)
Cost per 1 SED Source (USD)

< July 2024
Score (%)
Cost per 1 SED Source (USD)

< July 2024
Score (%)
Cost per 1 SED Source (USD)

+ 3 months
Google Gemma-2
Google
Gemini-1.5
Open-Weight
Proprietary
DeepSeek v2
Score (%)
Cost per 1 SED Source (USD)

Alibaba Qwen-2.5
Open-Weight
Proprietary
Meta LLaMA 3
+ 3 months
Yi 01
X's Grok
Stepfun
Microsoft
Phi-3.5
Nvidia's Nemotron
Score (%)
Cost per 1 SED Source (USD)

Open-Weight
Proprietary
+ 3 months
+ 3 months
Proprietary
(Experimental / Not Released)
DeepSeek v3 / R1
Score (%)
Cost per 1 SED Source (USD)

Open-Weight
Proprietary
+ 3 months
+ 3 months
Proprietary
(Experimental / Not Released)
OpenAI (o3)
Google Gemini-2.0
Score (%)
Cost per 1 SED Source (USD)

Open-Weight
Proprietary
+ 3 months
+ 3 months
Proprietary
(Experimental / Not Released)
Microsoft
Phi-4
MiniMax 01
Gemini-2.5-Pro
Claude-3.7-Sonnet
Meta LLaMA 4
1B sources = $1 billion
e.g., Roman Space Telescope, Euclid Space Telescope
~ approximately the build cost (July 2024)
3% of the build cost (March 2025)

Mephisto achieves the same success rate with 1/30 of the cost
March 2025
Sun, YST+, 2025
GPT-4o
QwQ32B

Data-poor , Theory-rich

Collecting
more data
???

Data-poor , Theory-rich
Data-rich , Theory-poor

Roman, HSC, Euclid, DESI, SDSS, PFS

Data-poor , Theory-rich

What A.I. agent
can solve
Interesting astronomy problems
What A.I. agent
can solve
Interesting astronomy problems
JWST SED Fitting
Summary :
Nonetheless, expectations should be tempered — the Moravec paradox makes AI capabilities uneven for full autonomy.
Fine-tuning models, building ecosystems and proper benchmarking to enable cost-effective, well-rounded AI agents is the path forward.
Modern LLMs' reasoning capabilities make AI agents an exciting new paradigm for astronomical research.
Though limited in reasoning, Mephisto analyzes and navigates SED physical models as effectively as humans.
Agentic Astronomy 2026
By Yuan-Sen Ting
Agentic Astronomy 2026
- 88