What do you do with 2457 theses?

John Dingle

Brock University

jdingle@brocku.ca

Topics

Extracting structured data from text
Value-added services for digital repositories
Usage statistics / collection assessment

Starting Point

We have all the graduate theses produced at Brock each year, open-access, available via an API.

What could we do with them?

Cited references from theses

in machine-readable format

How hard could it be?

PDF

TXT

XML

JSON

DOIs

10.01439/8632452352357

10.13039/501100006895

10.15014/763223432244

Available Tools = 3

Needed Functionality
- Accept .txt or .pdf, return TEI XML
- Conditional Random Fields used to parse text into sections and identify components, including references
Open source, actively maintained
Demo site available or easy install

ParsCit

Cermine

Grobid

But which one is the BEST?

Grobid

Java, Apache license

Run as .jar, accessed via a RESTful API

In production at ResearchGate, Mendeley,

CERN digital library (Invenio)

Extracts 55 elements from scholarly documents

- Sections, figures, references

Sample request

.pdf

http

Grobid Server

public demo
local install
Docker image

http

.xml

CrossRef

<biblStruct   xml:id="b5">
	<analytic>
		<title level="a" type="main">Domestication by Cappuccino or a 
Revenge on Urban Space? Control and Empowerment in the Management of Public Spaces</title>
		<author>
			<persName xmlns="http://www.tei-c.org/ns/1.0" coords="">
                            <forename type="first">R</forename>
                            <surname>Atkinson</surname>
                        </persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Urban Studies</title>
		<title level="j" type="abbrev">CURS</title>
		<idno type="ISSN">0042-0980</idno>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page" from="1829" to="1843" />
			<date type="published" when="2003" />
		</imprint>
	</monogr>
	<idno type="doi">10.1080/0042098032000106627</idno>
</biblStruct>

How accurate?

10 sample theses, each from a different discipline

Extracted list of references (as DOIs) with Grobid

Compared with a manually generated list

How accurate?

Best - MA Psychology

Correct: 77%

False Negative: 9%

False Positive: 14%

How accurate?

Average - M.Ed.

Correct: 63%

False Negative: 22%

False Positive: 15%

How accurate?

Worst - M.Sc Chem.

Correct: 0%

False Positive: 0%

False Negative: 100%

How accurate?

Grobid claims ~80% accuracy

tested on PubMed Central articles

https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf

What to do?

Extract citation data for Library purposes

Make available in repository to users

Use in broader text mining workflows

Extracting structured data from text
Value-added services for digital repositories
Usage statistics / collection assessment

Questions?

jdingle@brocku.ca

slides.com/jdingle/c4ln2017

c4ln2017

By jdingle

c4ln2017

8 years ago
608

What do you do with 2457 theses?

Topics

Starting Point

How hard could it be?

Available Tools = 3

Grobid

Grobid

Sample request

How accurate?

How accurate?

How accurate?

How accurate?

How accurate?

What to do?

c4ln2017

More from jdingle