GROBID Camp

Spring 2018

   Patrice Lopez

 

Development environment 1/2

  •  Switch to gradle for build and dependency management

      + speed (tests run 5 times faster, including Travis)

      + more readability

      + more control

      + more stable (damn mvn plugins!)

      - but no war generated for Tomcat/JBoss deployment

        (issue #272)

Development environment 2/2

  • Dropwizard for implementing the RESTful web services
    • framework that bundles all the required dependencies (jetty/jackson/jersey/metrics/guava/logback/etc.)
    • get directly a complete stable framework covering all aspects for RESFTful services (including tracking the application performance)

Enhancement of bibliographical reference processing

 

  • Effort with CERN last summer, with Andreas la Roi
  • Addition of a total of ~2500 training examples from INSPIRE-HEP (bib. ref or ref. bib. zones)
  • Reliable DOI and arXiv extraction and normalization
  • Better URL extraction
  • Identification of collaborations (ATLAS, LHC, etc.) (<orgName type="collaboration">)

Improvement on PubMed Central sample (1942 PDF)

 

Ratcliff/Obershelp Matching =

(Minimum Ratcliff/Obershelp similarity at 0.95)

 

all fields           98.69        88.45        77.69        82.72   (micro average)
                          98.69        88.03        77.8          82.57   (macro average)

 

all fields           98.58        89.83        80.1          84.68   (micro average)
                          98.58        89.85        80.27        84.77   (macro average)

 

Recent comparison

 

Test with 3,185 documents 64,495 references in total, mostly from chemical domains

 

 

 

 

 

 

D Tkaczyk, A Collins, P Sheridan, J Beel - arXiv preprint arXiv:1802.01168, 2018 - arxiv.org
(author of CERMINE)

Support of CrossRef REST API

  • We were supporting the CrossRef OpenURL web API
    • very high quality, specialized in metadata look-up
    • but super slow ~1 query per second
    • CrossRef library account necessary
    • not so good with titles (relatively strict matching)
  • New CrossRef REST API (thanks to Vincent Kaestle!)
    • up to 50 queries per second
    • no special account necessary
    • but more a search interface than a metadata look-up service: false positive, not possible to query with volume, issue, ...
    • no SLA, it can go down to a one query per second during several hours

 

Improvement on Pubmed Central API - Header model

 

Ratcliff/Obershelp Matching, similarity at 0.95)
===== Field-level results =====     end 2015

                       label                accuracy     precision    recall       f1      

all fields        95.2        78.03        71.45        74.59      (micro average)
                       95.2        77.98        70.86        74.17      (macro average)
===== Field-level results =====     version 0.4.1

all fields        95.69       80.31       74.71        77.41      (micro average)
                       95.69       80.57       74.32        77.24      (macro average)

===== Field-level results =====   with current with new CrossRef API - v0.5.1

all fields        96.56        87.39      82.98        85.13      (micro average)
                96.56       86.94       82.08        84.32      (macro average)

 

Support of CrossRef REST API

  • Solution for title/author look-up in particular header model, not supporting the usual journal name/volume/first page query so weak for bib. references
  • A "slow" consolidation based on old CrossRef OpenURL web API would still be necessary for bibliographical reference
  • ... but we might simply implement our own resolver service by acquiring the complete CrossRef repo, subscribing to the new CrossRef Metadata APIs Plus Service (proposed since January 2018, with a SLA)

 

 

TEI-based full-text benchmarking

 

  • We routinely evaluate GROBID with the PubMed Central sample, 1942 PDF + nlm files, in particular for each release to monitor accuracy, runtime, etc.
  • Following an evaluation exercice of ISTEX R&D on automatic full text structuring with GROBID, we extended the benchmarking to PDF + TEI files, with TEI generated by Pub2TEI from the native publisher XML file

 

Current work in progress

 

  • Better PDF parsing: pdfalto (composed and special characters, reading order, spacing, etc.)

  • Structuring ebook (pdf/ALTO): training based on embedded "outline" (project Opaline)

  • Long due new header model: regenerate and reformat training data, new features, etc. targeting 0.90 f1 instance-based

  • New DL models for sequence labelling, and for text classification (Keras, efficient java embeddings)

    • it is challenging to make it production ready (loading of resources, native integration, memory usage)

       

Thanks !

   Patrice Lopez

 

GROBID Camp 27.03.2018

By kermitt2

GROBID Camp 27.03.2018

  • 859