Grobid
Metadata extraction from PDF documents using machine learning tools in INSPIRE-HEP
Jacopo Notarstefano, Ilias Koutsakis and Jan Åge Lavik
Invenio Developers Forum, 2015-09-28
Use case:
- Only have the PDF of a paper
- Limited metadata information or no easy export from source
- Save time typing metadata manually
- User / Cataloger perspective
Step 1: Initial interface
- Allow catalogers to upload PDF
- Automatic extraction of metadata
- Check extracted metadata
- Push to system
Step 2: Editing
- Allow catalogers to edit extracted data
- Perhaps integrate JSONEditor?
Step 3: Refextract killer
- Take over bibliographic reference parsing
- Consolidation to other services (E.g. CrossRef)
- Integrate in automatic ingestion workflows
Step 4: Submission
- Pre-fill user submission forms from PDF
Future: Feedback and retraining
- System feeds back corrections to Grobid
- Advanced visual interface to correct extraction (E.g. side-by-side view)
Grobid
- Java Library
- “Conditional Random Fields” (CRF)
- Raw data -> TEI (Text Encoding Initiative)
- Widely used (E.g. ResearchGate, Mendeley, HAL)
Example
<TEI
xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader xml:lang="en">
<fileDesc>
<titleStmt>
<title level="a" type="main">
Spectra generated by a confined softcore Coulomb potential
</title>
</titleStmt>
<publicationStmt>
<publisher/>
<availability status="unknown">
<licence/>
</availability>
<date type="published" when="2014-07-28">28 Jul 2014</date>
</publicationStmt><sourceDesc>
<biblStruct>
<analytic>
<author>
<persName>
<forename type="first">Richard</forename>
<forename type="middle">L</forename>
<surname>Hall</surname>
</persName>
<affiliation>
<orgName type="department">
Department of Mathematics and Statistics
</orgName>
<orgName type="institution">
Concordia University
</orgName>
<address>
<addrLine>1455 de Maisonneuve Boulevard West</addrLine>
<postCode>H3G 1M8</postCode>
<settlement>Montréal</settlement>
<region>Québec</region>
<country key="CA">Canada</country>
</address>
</affiliation>
</author><biblStruct xml:id="b2">
<analytic>
<title level="a" type="main">
Spectral Properties of the Biconfluent Heun differential equation
</title>
<author>
<persName>
<forename type="first">E</forename>
<forename type="middle">R</forename>
<surname>Arriola</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">A</forename>
<surname>Zarzo</surname>
</persName>
</author>
<author>
<persName>
<forename type="first">J</forename>
<forename type="middle">S</forename>
<surname>Dehesa</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">J. Comput. Appl. Math</title>
<imprint>
<biblScope unit="volume">37</biblScope>
<biblScope unit="page">161</biblScope>
<date type="published" when="1991" />
</imprint>
</monogr>
</biblStruct><titleStmt>
<title level="a" type="main"></title>
</titleStmt>
<publicationStmt>
<publisher>Wu</publisher>
<availability status="unknown">
<p>Copyright Wu</p>
</availability>
<date>19</date>
</publicationStmt>Grobid Official Docs
Our Grobid fork, tailored* for HEP papers
*collaboration handling, improved reference handling
Master thesis of Joseph Boyd
Demo
Issues
- PDF to XML conversion library is unmaintained (fails occasionally) (need to add fallback to text)
Grobid
By Jan Åge Lavik
Grobid
- 623