Agentic Astronomy

Yuan-Sen Ting (丁源森）

The Ohio State University

Expediting Discoveries in Astronomy with A.I. Agents

NSF awarded over $200 million for AI Research Institutes

~ 2 centers

Physical Sciences

~ 3 centers

7 centers x 15M ~ 100M

Environmental Sciences

Biological Sciences

Hype, myth, or real deal?

Why hasn't astronomy had its
"AlphaFold" moment yet?"

Most AI in Astronomy focuses on extending statistical methods

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M

\sigma_8

Dark Matter Density

Growth Amplitude

E.g.,
simulation-based
inferences

Cheng, YST+, 2020

or building effective workflow control / optimization

Rubin Observatory

The goal here is NOT just solving things faster that we can already solve,

but solving astrophysical problems that would otherwise be too complex to solve

Applying A.I. to individual tasks
will have limited impacts in astrophysics

The complexity of astronomy is too low for AI

My niece

Highly non-Gaussian

Weakly non-Gaussian

Cosmic large-scale structure

Astronomy is not biology

Data / Observation

Theory / Hypothesis

Analysis Pipelines

True

False

Biology faced fundamental bottlenecks from individual tasks

Data / Observation

Theory / Hypothesis

Analysis Pipelines

True

False

Alphafold

Astronomy already has a successful standard model

Data / Observation

Theory / Hypothesis

LamdaCDM

True

False

Toward Agentic Research for Astronomy

Data

Theory

State of the research

Making "plans"

Making "hypotheses"

Beyond just individual task optimizations

A.I. in Math Olympiads

A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025

In open-world exploration, can large language models match human researchers at navigating vast hypothesis spaces?

Sun, YST+, 2024b, 2025

??

Can A.I. agents understand spectral data (spectral energy distribution) from JWST?

Hypothesis space is vast and beyond mathematical formalism

A default fit with
an SED model

Extinction model ?

Hypothesis space is vast and beyond mathematical formalism

Young stellar population?

Hypothesis space is vast and beyond mathematical formalism

Many real-world problems aren't simple optimization problems

The objective goes beyond minimizing a single error metric.

Many tasks may require modifying assumptions / physical models, not just optimizing over all parameters

Hypothesis spaces are vast and hard to parameterize.

Can a large-language model learn
from its own experience?

Human "intuition" + experience

Introducing *Mephisto**

* In the classic tale of Faust, Mephisto is a demon who tempts the scholar Faust with knowledge and power in exchange for his soul.

A collaboration of multiple AI agents (LLM models)

Proposing actions

Execute actions

State evolution

Knowledge distillation

A collaboration of multiple AI agents (LLM models)

Proposing actions

Execute actions

State evolution

Knowledge distillation

Enabling AI to collect "knowledge" through exploration

Knowledge base

Proposing Actions - e.g., different physical models / parameter range

Enabling AI to collect "knowledge" through exploration

Knowledge base

Execute Actions - write configuration files, run the codes, automously

Enabling AI to collect "knowledge" through exploration

Knowledge base

vs.

State Evaluation - evaluate the results (beyond a single error metric)

Enabling AI to collect "knowledge" through exploration

Knowledge base

vs.

Knowledge Distillation - summarise useful actions given the previous state

Mephisto - deployed as "walkers" in the hypothesis space

Example of learned "knowledge"

" If the fit is overestimated in the UV and optical bands,

increasing the E_BV_lines parameter may lead to a better fit by accounting for more dust attenuation in these bands. "

Number of Learning Iterations

5.1

5.6

6.0

6.4

GPT-4o baseline --
"think without knowledge"

Chi-Square of the Fit

LLMs with self-improvement outperforms native LLMs

Fitting JWST JADES data

Sun, YST+, 2024b

Number of Learning Iterations

5.1

5.6

6.0

6.4

GPT-4o baseline --
"think without knowledge"

Mephisto

Chi-Square of the Fit

Sun, YST+, 2024b

LLMs with self-improvement outperforms native LLMs

Sun, YST+, 2025

Mephisto operates as walkers exploring the "hypothesis space"

With COSMOS2020 SEDs

Mephisto finds better solutions using only 1% of the trials that brute force methods require

Explaining James Webb's "little red dot" galaxies with Mephisto

Wavelength [micron]

Flux

Sun, YST+, 2025

A seamless and interpretable AI-human collaboration

Learn from the data

Summarize "knowledge"

Examine and include prior knowledge

A seamless and interpretable AI-human collaboration

Expedite discovery

Use the learned knowledge as context

https://tingyuansen.github.io/NASA_AI_ML_STIG

Next Monday (4pm ET)

Graduate student

The Plot Twist

A.I. still struggles with many tasks that are easy for humans

Princeton Language and Intelligence Lab, June 2024

Human accuracy: ~80%

GPT-4o: ~47%

Can A.I. reason about scientific charts?

ARC Prize Foundation (ARC-AGI-2, 2025)

Spatial Pattern Reasoning

Human Panel : ~ 100%

GPT-5 : ~10%

Moravec's Paradox (1988)

- High-level reasoning is easy for AI;
basic sensory-motor skills are hard

Reversing the evolution of "intelligence"

Evolution Timeline: What came first vs. last

Conversational and logical abilities are the easiest to imitate

Easy-for-AI

Complex calculations

Easy-for-Human

Logical inference (?)

Memorizing information

Language

Coding

Spatial reasoning

Common sense physics
(water flows downhill)

Basic motor skills

Visual reasoning

Understanding context

A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025

Visual reasoning remains a limiting factor for AI agents

Pinheiro, ..., YST+, 2025

Yang,... YST+, 2025, ICCV

We don't understand how people intuitively understand plots

Yang,... YST+, 2025, ICCV

AI is still 20-50 points worse than humans

Brute force fine-tuning can close the gap in simple descriptive tasks, but not in visual reasoning tasks

Can A.I. come up with good ideas?

How do human come up with good ideas?

YST+, 2025

Our concepts show finer granularity than keywords

YST+, 2025

Our concepts show finer granularity than keywords

YST+, 2025

Visualizing the knowledge graph in astronomy

Sun, YST+, 2024a

astrokg.github.io

The temporal evolution of concept
co-occurrences
in papers

Cosmology

Galaxy

High-energy

Sun/Star

Exoplanet

Simulation

Instrument

AI/Stat

Cosmology

Galaxy

High
-energy

Star

Planet

Sims

Instru.

AI/Stats

Sun/Star

Applications of AI in Stats

We also need a capable model that can generate run cost efficiently....

capable model

vs.

cost efficiency

e.g., GPT-5

In the SED case study, we need ~0.1M tokens per source

= USD 1 per source ...

1B sources = $1 billion

e.g., Roman Space Telescope, Euclid Space Telescope

~ approximately the build cost

Can we improve lightweight
open-weights LLMs to perform well on astronomical tasks?

Natural Language Processing experts

Oak Ridge
National Lab

Argonne
National Lab

AstroMLab (astromlab.org)

Harvard-Smithsonian ADS

U. Ilinois
Urbana-Champaign

De Haan, YST+ 2025

YST, AstroMLab+, 2025

The first extensive benchmarking effort in astronomy

Knowledge Recall

Benchmark multiple choice question - example

What is the primary reason for the decline in the number density of luminous quasars at redshifts greater than 5?

A decrease in the overall star formation rate, leading to fewer potential host galaxies for quasars.

An increase in the neutral hydrogen fraction in the intergalactic medium, which obscures the quasars’ light.

A decrease in the number of massive black hole seeds that can form and grow into supermassive black holes.

An increase in the average metallicity of the Universe, leading to a decrease in the efficiency of black hole accretion.

Score (%)

Cost per 1 SED Source (USD)

Domain experts ~67% (20 points below AI)

AstroSage-8B
(de Haan, YST+ 2025a)

AstroSage-70B
(de Haan, YST+ 2025b)

For astronomy Q&A, AstroSage-70B delivers GPT-5-level performance while costing 20x less

Pan, ..., YST, 2025, to be submitted

Beyond just benchmarking astronomical knowledge recall

What A.I. agent
can solve

Interesting astronomy problems

What A.I. agent
can solve

Interesting astronomy problems

JWST SED Fitting

Summary :

Nonetheless, expectations should be tempered — the Moravec paradox makes AI capabilities uneven for full autonomy.

Fine-tuning models, building ecosystems and proper benchmarking to enable cost-effective, well-rounded AI agents is the path forward.

Modern LLMs' reasoning capabilities make AI agents an exciting new paradigm for astronomical research.

Though limited in reasoning, Mephisto analyzes and navigates SED physical models as effectively as humans.

Extra Slides

Making "plans"

Making "hypotheses"

Annotated
Labelled Data

supervised
tasks

Unlabelled Data

foundational models

Interacting with "physical" world

AI
astronomer

Example of learned "knowledge"

" If there is a gross underestimation in the MWIR bands,

consider exploring a wider range of fracAGN values in the agn module to improve the fit in these bands "

Number of Learning Iterations

5.1

5.6

6.0

6.4

Chi-Square

Chi-Square of the Fit

Why this plateau ??

Sun, YST+, 2024b

LLMs with self-play RL outperforms native LLMs

Number of Learning Iterations

5.1

5.6

6.0

6.4

Chi-Square

Chi-Square of the Fit

- Number of photometry bands fitted within 1σ

Sun, YST+, 2024b

LLMs with self-play RL outperforms native LLMs

Number of Learning Iterations

5.1

5.6

6.0

6.4

Chi-Square

Chi-Square of the Fit

- Number of photometry bands fitted within 1σ

"Exploration"

"Exploitation"

Sun, YST+, 2024b

LLMs with self-play RL outperforms native LLMs

Beyond the thesaurus, extracting concepts from all arXiv papers.

300,000 papers

Mistral 7b

1,000,000 concepts

Control the desired granularity through merging and pruning

Mistral 7b

Concept merging and pruning

spectra
= spectroscopy
= spectral analysis

Unified Astronomy Thesaurus

Too granular

"Concepts" extracted with large-language models

Quantifying the growth of the field -- by groups of concepts

Year

2000

2005

2010

2015

2020

Count [thousands]

Scientific concepts

Sun, YST+, 2024a

Quantifying the growth of the field -- by groups of concepts

Year

2000

2005

2010

2015

2020

1.5

Count [thousands]

Numerical simulation

1.2

0.9

0.6

0.3

Statistics

Sun, YST+, 2024a

The number of ML concepts in astronomy has not grown

Year

2000

2005

2010

2015

2020

1.5

Count [thousands]

1.2

0.9

0.6

0.3

Machine learning

Linear Regression,
Gaussian Process, Random Forest, ......

152

210

230

Sun, YST+, 2024a

Quantifying the cross-domain interaction:
How technical concepts inspire scientific ones

Knowledge graph via the literature-citation metric

Concept

Paper

Ting et al.

Contain

Einstein et al.

Contain

citation

Concept B:
Plasmon

Concept A:
Dark Matter

Concept

Concept B:
Plasmon

Distance between concept A to B =

Paper

averaged over all papers containing concept A

Knowledge graph via the literature-citation metric

Concept

Paper

Technical concept:
Neural Networks

Scientific concept: Large-Scale Structure

Cross-domain linkage shows a two-phase evolution

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

Numerical simulation
x scientific concepts

Technology development

Sun, YST+, 2024a

Concept

Paper

Scientific Concept: Large-Scale Structure

Numerical Simulations

Simulations being developed

Linkage
decoupled

Cross-domain linkage shows a two-phase evolution

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

Numerical simulation
x scientific concepts

Technology deployment

Technology development

Sun, YST+, 2024a

Concept

Paper

Scientific Concept: Large-Scale Structure

Numerical Simulations

Simulations being deployed to sciences

Linkage increases

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

Numerical simulation
x scientific concepts

N-body
simulation

Hydrodynamical simulation

Cross-domain linkage shows a two-phase evolution

Sun, YST+, 2024a

Interest in AI x Astronomy outpaces technological development

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

ML x Scientific concepts

Gaussian process
multi-layer perceptron

AstroBench: High quality astronomy QA benchmark dataset

Nguyen, YST+ 2023

Worse than GPT-4o

Score (%)

Cost per 1 SED Source (USD)

Cheaper but
not as good

Can vary by three order of magnitude in "value"!

domain experts

July 2024

YST, AstroMLab+, 2025a

Trustworthiness : Are you sure?

YST, AstroMLab+, 2025a

Confidence (%)

Fraction of Correct Answer (%)

100

Under-confident

Over-confident

Model pre-summer 2024

After summer 2024

Proprietary models

Score (%)

As of July 2024

YST, AstroMLab+, 2025a

Special thanks to

Warren Buffet :
" The trick is, when there is nothing to do, do nothing "

Still it is not very scalable

LLaMA-3.1 70b throughput on four H100 GPUs

= ~ 100 tokens / second

1 SED source = 15 GPU minutes

1B sources = 10M GPU days

A cluster with 10,000 H100 GPUs
running for 3 years

= 0.03 USD

= 40 USD

Huang's Law

Compute Power

Year

CPU Moore's Law is plateauing

GPU is
picking up the pace

LLMs are getting very cheap, very quickly

The price drop has an e-folding time of appromately
3 months

YST, AstroMLab+, 2025

Score (%)

Cost per 1 SED Source (USD)

< July 2024

Score (%)

Cost per 1 SED Source (USD)

< July 2024

Score (%)

Cost per 1 SED Source (USD)

+ 3 months

Google Gemma-2

Google
Gemini-1.5

Open-Weight

Proprietary

DeepSeek v2

Score (%)

Cost per 1 SED Source (USD)

Alibaba Qwen-2.5

Open-Weight

Proprietary

Meta LLaMA 3

+ 3 months

Yi 01

X's Grok

Stepfun

Microsoft
Phi-3.5

Nvidia's Nemotron

Score (%)

Cost per 1 SED Source (USD)

Open-Weight

Proprietary

+ 3 months

Proprietary

(Experimental / Not Released)

DeepSeek v3 / R1

Score (%)

Cost per 1 SED Source (USD)

Open-Weight

Proprietary

+ 3 months

Proprietary

(Experimental / Not Released)

OpenAI (o3)

Google Gemini-2.0

Score (%)

Cost per 1 SED Source (USD)

Open-Weight

Proprietary

+ 3 months

Proprietary

(Experimental / Not Released)

Microsoft
Phi-4

MiniMax 01

Gemini-2.5-Pro

Claude-3.7-Sonnet

Meta LLaMA 4

1B sources = $1 billion

e.g., Roman Space Telescope, Euclid Space Telescope

~ approximately the build cost (July 2024)

3% of the build cost (March 2025)

Mephisto achieves the same success rate with 1/30 of the cost

March 2025

Sun, YST+, 2025

GPT-4o

QwQ32B

Data-poor , Theory-rich

Collecting
more data

???

Data-poor , Theory-rich

Data-rich , Theory-poor

Roman, HSC, Euclid, DESI, SDSS, PFS

Data-poor , Theory-rich

Yuan-Sen Ting (丁源森）

Expediting Discoveries in Astronomy with A.I. Agents

NSF awarded over $200 million for AI Research Institutes

Hype, myth, or real deal?

Why hasn't astronomy had its "AlphaFold" moment yet?"

Most AI in Astronomy focuses on extending statistical methods

Most AI in Astronomy focuses on extending statistical methods

E.g., simulation-based inferences

Cheng, YST+, 2020

or building effective workflow control / optimization

Applying A.I. to individual tasks will have limited impacts in astrophysics

The complexity of astronomy is too low for AI

Astronomy is not biology

Biology faced fundamental bottlenecks from individual tasks

Astronomy already has a successful standard model

Toward Agentic Research for Astronomy

Beyond just individual task optimizations

A.I. in Math Olympiads

A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025

In open-world exploration, can large language models match human researchers at navigating vast hypothesis spaces?

Sun, YST+, 2024b, 2025

??

Hypothesis space is vast and beyond mathematical formalism

Hypothesis space is vast and beyond mathematical formalism

Hypothesis space is vast and beyond mathematical formalism

Many real-world problems aren't simple optimization problems

The objective goes beyond minimizing a single error metric.

Many tasks may require modifying assumptions / physical models, not just optimizing over all parameters

Hypothesis spaces are vast and hard to parameterize.

Can a large-language model learn from its own experience?

Human "intuition" + experience

Introducing Mephisto*

A collaboration of multiple AI agents (LLM models)

A collaboration of multiple AI agents (LLM models)

Enabling AI to collect "knowledge" through exploration

Enabling AI to collect "knowledge" through exploration

Enabling AI to collect "knowledge" through exploration

vs.

vs.

vs.

vs.

Enabling AI to collect "knowledge" through exploration

vs.

vs.

vs.

vs.

Mephisto - deployed as "walkers" in the hypothesis space

Example of learned "knowledge"

LLMs with self-improvement outperforms native LLMs

Sun, YST+, 2024b

Sun, YST+, 2024b

LLMs with self-improvement outperforms native LLMs

Sun, YST+, 2025

Mephisto operates as walkers exploring the "hypothesis space"

Explaining James Webb's "little red dot" galaxies with Mephisto

Sun, YST+, 2025

A seamless and interpretable AI-human collaboration

A seamless and interpretable AI-human collaboration

The Plot Twist

A.I. still struggles with many tasks that are easy for humans

Can A.I. reason about scientific charts?

Spatial Pattern Reasoning

Moravec's Paradox (1988) - High-level reasoning is easy for AI; basic sensory-motor skills are hard

Reversing the evolution of "intelligence"

Evolution Timeline: What came first vs. last

Conversational and logical abilities are the easiest to imitate

Easy-for-AI

Easy-for-Human

A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025

Visual reasoning remains a limiting factor for AI agents

Pinheiro, ..., YST+, 2025

Yang,... YST+, 2025, ICCV

We don't understand how people intuitively understand plots

Yang,... YST+, 2025, ICCV

Can A.I. come up with good ideas?

How do human come up with good ideas?

YST+, 2025

Our concepts show finer granularity than keywords

Why hasn't astronomy had its
"AlphaFold" moment yet?"

E.g.,
simulation-based
inferences

Applying A.I. to individual tasks
will have limited impacts in astrophysics

Can a large-language model learn
from its own experience?

Introducing *Mephisto**

Moravec's Paradox (1988)

- High-level reasoning is easy for AI;
basic sensory-motor skills are hard

The temporal evolution of concept
co-occurrences
in papers

Can we improve lightweight
open-weights LLMs to perform well on astronomical tasks?

What A.I. agent
can solve

What A.I. agent
can solve

supervised
tasks

AI
astronomer

Quantifying the cross-domain interaction:
How technical concepts inspire scientific ones