GROBID Camp

Spring 2018

Patrice Lopez

Development environment 1/2

Switch to gradle for build and dependency management

+ speed (tests run 5 times faster, including Travis)

+ more readability

+ more control

+ more stable (damn mvn plugins!)

- but no war generated for Tomcat/JBoss deployment

(issue #272)

Development environment 2/2

Dropwizard for implementing the RESTful web services
- framework that bundles all the required dependencies (jetty/jackson/jersey/metrics/guava/logback/etc.)
- get directly a complete stable framework covering all aspects for RESFTful services (including tracking the application performance)

Enhancement of bibliographical reference processing

Effort with CERN last summer, with Andreas la Roi
Addition of a total of ~2500 training examples from INSPIRE-HEP (bib. ref or ref. bib. zones)
Reliable DOI and arXiv extraction and normalization
Better URL extraction
Identification of collaborations (ATLAS, LHC, etc.) (<orgName type="collaboration">)

Improvement on PubMed Central sample (1942 PDF)

Ratcliff/Obershelp Matching =

(Minimum Ratcliff/Obershelp similarity at 0.95)

all fields 98.69 88.45 77.69 82.72 (micro average)
98.69 88.03 77.8 82.57 (macro average)

all fields 98.58 89.83 80.1 84.68 (micro average)
98.58 89.85 80.27 84.77 (macro average)

Recent comparison

Test with 3,185 documents 64,495 references in total, mostly from chemical domains

D Tkaczyk, A Collins, P Sheridan, J Beel - arXiv preprint arXiv:1802.01168, 2018 - arxiv.org
(author of CERMINE)

Support of CrossRef REST API

We were supporting the CrossRef OpenURL web API
- very high quality, specialized in metadata look-up
- but super slow ~1 query per second
- CrossRef library account necessary
- not so good with titles (relatively strict matching)
New CrossRef REST API (thanks to Vincent Kaestle!)
- up to 50 queries per second
- no special account necessary
- but more a search interface than a metadata look-up service: false positive, not possible to query with volume, issue, ...
- no SLA, it can go down to a one query per second during several hours

Improvement on Pubmed Central API - Header model

Ratcliff/Obershelp Matching, similarity at 0.95)
===== Field-level results ===== end 2015

label accuracy precision recall f1

all fields 95.2 78.03 71.45 74.59 (micro average)
95.2 77.98 70.86 74.17 (macro average)
===== Field-level results ===== version 0.4.1

all fields 95.69 80.31 74.71    77.41    (micro average)
   95.69 80.57 74.32    77.24    (macro average)
===== Field-level results =====   with current with new CrossRef API - v0.5.1

all fields 96.56 87.39 82.98 85.13 (micro average)
96.56 86.94 82.08 84.32 (macro average)

Support of CrossRef REST API

Solution for title/author look-up in particular header model, not supporting the usual journal name/volume/first page query so weak for bib. references
A "slow" consolidation based on old CrossRef OpenURL web API would still be necessary for bibliographical reference
... but we might simply implement our own resolver service by acquiring the complete CrossRef repo, subscribing to the new CrossRef Metadata APIs Plus Service (proposed since January 2018, with a SLA)

TEI-based full-text benchmarking

We routinely evaluate GROBID with the PubMed Central sample, 1942 PDF + nlm files, in particular for each release to monitor accuracy, runtime, etc.
Following an evaluation exercice of ISTEX R&D on automatic full text structuring with GROBID, we extended the benchmarking to PDF + TEI files, with TEI generated by Pub2TEI from the native publisher XML file

Current work in progress

Better PDF parsing: pdfalto (composed and special characters, reading order, spacing, etc.)
Structuring ebook (pdf/ALTO): training based on embedded "outline" (project Opaline)
Long due new header model: regenerate and reformat training data, new features, etc. targeting 0.90 f1 instance-based
New DL models for sequence labelling, and for text classification (Keras, efficient java embeddings)
- it is challenging to make it production ready (loading of resources, native integration, memory usage)

GROBID Camp

Spring 2018

Patrice Lopez

Development environment 1/2

Development environment 2/2

Enhancement of bibliographical reference processing

Improvement on PubMed Central sample (1942 PDF)

Recent comparison

Support of CrossRef REST API

Improvement on Pubmed Central API - Header model

Support of CrossRef REST API

TEI-based full-text benchmarking

Current work in progress

Thanks !

Patrice Lopez