Yuan-Sen Ting (OSU)
AstroMLab 5:
Structured Summaries and Concept Extraction for 400,000 Astrophysics Papers

on behalf of AstroMLab
In open-world exploration, can large language model agents match human researchers?


A seamless AI-human collaboration




Learn from the data
Summarize "knowledge"
Examine and include prior knowledge


The Ohio State University





Oak Ridge
National Lab

Argonne
National Lab

AstroMLab (astromlab.org)

Harvard-Smithsonian ADS


U. Ilinois
Urbana-Champaign


De Haan, YST+ 2025
Score (%)
Cost per 1 SED Source (USD)

AstroSage-8B
(de Haan, YST+ 2025a)
AstroSage-70B
(de Haan, YST+ 2025b)
For astronomy Q&A, AstroSage-70B delivers GPT-5-level performance while costing 20x less
Robust agentic pipelines in astronomy demand a more robust "searchable" database.

Full text
Keywords
Keywords are hopeless... either too generic ...


Fleshing out the hierarchy
Full text
Keywords
Concepts
Summary
High-quality summaries from top-tier models (GPT-4o++)
NSF NAIRR, Microsoft Accelerating Foundation Models Academic Research Grant, OpenAI Research Access Program, NVIDIA Academic Grant Program.

Distilling further to obtain standardized concepts

Our concepts offer finer granularity than keywords

Concepts spread across the embedding space while summary and abstract embeddings cluster, enabling more diverse retrieval

Summary and abstract
Understanding how concepts connect - within domain

Understanding how concepts connect - cross domain

How concept
co-occurrences evolve in
astro-ph papers
Cosmology
Galaxy
High-energy
Sun/Star
Exoplanet
Simulation
Instrument
AI/Stat
Cosmology
Galaxy
High
-energy
Star
Planet
Sims
Instru.
AI/Stats
Sun/Star
Applications of AI in Stats
Visualizing the knowledge graph in astronomy
Sun, YST+, 2024
astrokg.github.io
All concepts and derivatives are publicly available !

https://github.com/tingyuansen/astro-ph_knowledge_graph
Summary :
LLM-curated summaries and ~10,000 concepts bridge this gap, enabling finer-grained retrieval and classification.
Concept co-occurrences and knowledge graphs reveal how research themes in astronomy connect and evolve over time.
Autonomous agents need better knowledge bases—full text, abstracts, and keywords alone don't cut it for effective retrieval.
Concept embeddings capture diverse, cross-cutting themes that abstract and full-text embeddings miss.
Keywords are hopeless... or too specific ...



Our concepts offer finer granularity than keywords
Summaries and Concepts in astro-ph
By Yuan-Sen Ting
Summaries and Concepts in astro-ph
- 130