Yuan-Sen Ting (OSU)

AstroMLab 5:

Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

on behalf of AstroMLab

In open-world exploration, can large language model agents match human researchers?

A seamless AI-human collaboration

Learn from the data

Summarize "knowledge"

Examine and include prior knowledge

The Ohio State University

Oak Ridge
National Lab

Argonne
National Lab

 AstroMLab (astromlab.org)

Harvard-Smithsonian ADS

U. Ilinois
Urbana-Champaign

De Haan, YST+ 2025

Score (%)

Cost per 1 SED Source (USD)

AstroSage-8B
(de Haan, YST+ 2025a)

AstroSage-70B
(de Haan, YST+ 2025b)

For astronomy Q&A, AstroSage-70B delivers GPT-5-level performance while costing 20x less

Robust agentic pipelines in astronomy demand a more robust "searchable" database.

Full text

Keywords

Keywords are hopeless... either too generic ...

Fleshing out the hierarchy

Full text

Keywords

Concepts

Summary

High-quality summaries from top-tier models (GPT-4o++)

NSF NAIRR, Microsoft Accelerating Foundation Models Academic Research Grant, OpenAI Research Access Program, NVIDIA Academic Grant Program.

Distilling further to obtain standardized concepts

Our concepts offer finer granularity than keywords

Concepts spread across the embedding space while summary and abstract embeddings cluster, enabling more diverse retrieval

Summary and abstract

Understanding how concepts connect - within domain

Understanding how concepts connect - cross domain

How concept
co-occurrences evolve in
astro-ph papers

Cosmology

Galaxy

High-energy

Sun/Star

Exoplanet

Simulation

Instrument

AI/Stat

Cosmology

Galaxy

High
-energy

Star

Planet

Sims

Instru.

AI/Stats

Sun/Star

Applications of AI in Stats

Visualizing the knowledge graph in astronomy

Sun, YST+, 2024

astrokg.github.io

All concepts and derivatives are publicly available ! 

https://github.com/tingyuansen/astro-ph_knowledge_graph

Summary :

LLM-curated summaries and ~10,000 concepts bridge this gap, enabling finer-grained retrieval and classification.

Concept co-occurrences and knowledge graphs reveal how research themes in astronomy connect and evolve over time.

Autonomous agents need better knowledge bases—full text, abstracts, and keywords alone don't cut it for effective retrieval.

Concept embeddings capture diverse, cross-cutting themes that abstract and full-text embeddings miss.

Keywords are hopeless... or too specific ...

Our concepts offer finer granularity than keywords

Summaries and Concepts in astro-ph

By Yuan-Sen Ting

Summaries and Concepts in astro-ph

  • 130