➡ Perfectly recognized references: f-score up to 75%
➡ Estimated 90% for recent articles
➡ 1 millionPDF processed in 24h (Xeon 10 CPU, 10
GB RAM, 3GB used in average, 9 threads) - so 11,5 PDF/s
➡ very large Open citation dataset (at least ~10 times Open Citation corpus)
The legendary GROBID CRF cascade
133 documents PMC/HAL
Addition of 91 documents ISTEX
Evaluation on 19 ISTEX documents
split 80-20% on 97 documents
20 documents de PubMed Central et HAL
Addition of 59 ISTEX full texts
evaluation on 18 ISTEX documents
split 80-20% on 97 documents
Ratcliff/Obershelp Matching, similarity at 0.95)
===== Field-level results ===== end 2015
label accuracy precision recall f1
all fields 95.2 78.03 71.45 74.59 (micro average)
95.2 77.98 70.86 74.17 (macro average)
===== Field-level results ===== version 0.4.1
all fields 95.69 80.31 74.71 77.41 (micro average)
95.69 80.57 74.32 77.24 (macro average)
===== Field-level results ===== with current with new CrossRef API - v0.5.1
all fields 96.56 87.39 82.98 85.13 (micro average)
96.56 86.94 82.08 84.32 (macro average)
Better PDF parsing: pdfalto (composed and special characters, reading order, spacing, etc.)
Structuring ebook (pdf/ALTO): training based on embedded "outline" (project Opaline)
Long due new header model: regenerate and reformat training data, new features, etc. targeting 0.90 f1 instance-based
New DL models for sequence labelling, and for text classification (Keras, efficient java embeddings)
it is challenging to make it production ready (loading of resources, native integration, memory usage)