What do you do with 2457 theses?
John Dingle
Brock University
jdingle@brocku.ca
Topics
- Extracting structured data from text
- Value-added services for digital repositories
- Usage statistics / collection assessment
Starting Point
We have all the graduate theses produced at Brock each year, open-access, available via an API.
What could we do with them?
Cited references from theses
in machine-readable format
How hard could it be?

TXT

XML
JSON
DOIs
10.01439/8632452352357
10.13039/501100006895
10.15014/763223432244
Available Tools = 3
-
Needed Functionality
- Accept .txt or .pdf, return TEI XML
- Conditional Random Fields used to parse text into sections and identify components, including references
- Open source, actively maintained
- Demo site available or easy install
ParsCit
Cermine
Grobid
But which one is the BEST?
Grobid
Grobid
Java, Apache license
Run as .jar, accessed via a RESTful API
In production at ResearchGate, Mendeley,
CERN digital library (Invenio)
Extracts 55 elements from scholarly documents
- Sections, figures, references
Sample request
http
Grobid Server
- public demo
- local install
- Docker image
http
.xml
CrossRef
<biblStruct xml:id="b5">
<analytic>
<title level="a" type="main">Domestication by Cappuccino or a
Revenge on Urban Space? Control and Empowerment in the Management of Public Spaces</title>
<author>
<persName xmlns="http://www.tei-c.org/ns/1.0" coords="">
<forename type="first">R</forename>
<surname>Atkinson</surname>
</persName>
</author>
</analytic>
<monogr>
<title level="j">Urban Studies</title>
<title level="j" type="abbrev">CURS</title>
<idno type="ISSN">0042-0980</idno>
<imprint>
<biblScope unit="volume">40</biblScope>
<biblScope unit="issue">9</biblScope>
<biblScope unit="page" from="1829" to="1843" />
<date type="published" when="2003" />
</imprint>
</monogr>
<idno type="doi">10.1080/0042098032000106627</idno>
</biblStruct>

How accurate?
10 sample theses, each from a different discipline
Extracted list of references (as DOIs) with Grobid
Compared with a manually generated list
How accurate?
Best - MA Psychology
Correct: 77%
False Negative: 9%
False Positive: 14%

How accurate?
Average - M.Ed.
Correct: 63%
False Negative: 22%
False Positive: 15%

How accurate?
Worst - M.Sc Chem.
Correct: 0%
False Positive: 0%
False Negative: 100%

How accurate?
Grobid claims ~80% accuracy
tested on PubMed Central articles
https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf
What to do?
Extract citation data for Library purposes
Make available in repository to users
Use in broader text mining workflows
- Extracting structured data from text
- Value-added services for digital repositories
- Usage statistics / collection assessment
Questions?
jdingle@brocku.ca
slides.com/jdingle/c4ln2017
c4ln2017
By jdingle
c4ln2017
- 608