Grobid

Metadata extraction from PDF documents using machine learning tools in INSPIRE-HEP

Jacopo Notarstefano, Ilias Koutsakis and Jan Åge Lavik

Invenio Developers Forum, 2015-09-28

Use case:

  • Only have the PDF of a paper
  • Limited metadata information or no easy export from source
  • Save time typing metadata manually
  • User / Cataloger perspective

Step 1: Initial interface

  • Allow catalogers to upload PDF
  • Automatic extraction of metadata
  • Check extracted metadata
  • Push to system

Step 2: Editing

  • Allow catalogers to edit extracted data
  • Perhaps integrate JSONEditor?

Step 3: Refextract killer

  • Take over bibliographic reference parsing
  • Consolidation to other services (E.g. CrossRef)
  • Integrate in automatic ingestion workflows

Step 4: Submission

  • Pre-fill user submission forms from PDF

Future: Feedback and retraining

  • System feeds back corrections to Grobid
  • Advanced visual interface to correct extraction (E.g. side-by-side view)

Grobid

  • Java Library
  • “Conditional Random Fields” (CRF)
  • Raw data -> TEI (Text Encoding Initiative)
  • Widely used (E.g. ResearchGate, Mendeley, HAL)

Example

<TEI
    xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader xml:lang="en">
        <fileDesc>
            <titleStmt>
                <title level="a" type="main">
                    Spectra generated by a confined softcore Coulomb potential
                </title>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
                <date type="published" when="2014-07-28">28 Jul 2014</date>
            </publicationStmt>
<sourceDesc>
    <biblStruct>
        <analytic>
            <author>
                <persName>
                    <forename type="first">Richard</forename>
                    <forename type="middle">L</forename>
                    <surname>Hall</surname>
                </persName>
                <affiliation>
                    <orgName type="department">
                        Department of Mathematics and Statistics
                    </orgName>
                    <orgName type="institution">
                        Concordia University
                    </orgName>
                    <address>
                        <addrLine>1455 de Maisonneuve Boulevard West</addrLine>
                        <postCode>H3G 1M8</postCode>
                        <settlement>Montréal</settlement>
                        <region>Québec</region>
                        <country key="CA">Canada</country>
                    </address>
                </affiliation>
            </author>
<biblStruct xml:id="b2">
    <analytic>
        <title level="a" type="main">
            Spectral Properties of the Biconfluent Heun differential equation
        </title>
        <author>
            <persName>
                <forename type="first">E</forename>
                <forename type="middle">R</forename>
                <surname>Arriola</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">A</forename>
                <surname>Zarzo</surname>
            </persName>
        </author>
        <author>
            <persName>
                <forename type="first">J</forename>
                <forename type="middle">S</forename>
                <surname>Dehesa</surname>
            </persName>
        </author>
    </analytic>
    <monogr>
        <title level="j">J. Comput. Appl. Math</title>
        <imprint>
            <biblScope unit="volume">37</biblScope>
            <biblScope unit="page">161</biblScope>
            <date type="published" when="1991" />
        </imprint>
    </monogr>
</biblStruct>
<titleStmt>
    <title level="a" type="main"></title>
</titleStmt>
<publicationStmt>
    <publisher>Wu</publisher>
    <availability status="unknown">
        <p>Copyright Wu</p>
    </availability>
    <date>19</date>
</publicationStmt>

Grobid Official Docs

Our Grobid fork, tailored* for HEP papers

*collaboration handling, improved reference handling

Master thesis of Joseph Boyd

Demo

Issues

  • PDF to XML conversion library is unmaintained (fails occasionally) (need to add fallback to text)

Grobid

By Jan Åge Lavik