What do you do with 2457 theses?

John Dingle

Brock University

jdingle@brocku.ca

Topics

  • Extracting structured data from text
  • Value-added services for digital repositories
  • Usage statistics / collection assessment

Starting Point

We have all the graduate theses produced at Brock each year, open-access, available via an API.

 

What could we do with them?

Cited references from theses

in machine-readable format

How hard could it be?

PDF

TXT

XML

JSON

DOIs

10.01439/8632452352357

10.13039/501100006895

10.15014/763223432244

Available Tools = 3

  • Needed Functionality
    • Accept .txt or .pdf, return TEI XML
    • Conditional Random Fields used to parse text into sections and identify components, including references
  • Open source, actively maintained
  • Demo site available or easy install

ParsCit

 

Cermine

 

Grobid

 

But which one is the BEST?

Grobid

Grobid

Java, Apache license

 

Run as .jar, accessed via a RESTful API

 

In production at ResearchGate, Mendeley,

CERN digital library (Invenio)

 

Extracts 55 elements from scholarly documents

- Sections, figures, references

 

Sample request

.pdf

http

Grobid Server

  • public demo
  • local install
  • Docker image

http

.xml

CrossRef

<biblStruct   xml:id="b5">
	<analytic>
		<title level="a" type="main">Domestication by Cappuccino or a 
Revenge on Urban Space? Control and Empowerment in the Management of Public Spaces</title>
		<author>
			<persName xmlns="http://www.tei-c.org/ns/1.0" coords="">
                            <forename type="first">R</forename>
                            <surname>Atkinson</surname>
                        </persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Urban Studies</title>
		<title level="j" type="abbrev">CURS</title>
		<idno type="ISSN">0042-0980</idno>
		<imprint>
			<biblScope unit="volume">40</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page" from="1829" to="1843" />
			<date type="published" when="2003" />
		</imprint>
	</monogr>
	<idno type="doi">10.1080/0042098032000106627</idno>
</biblStruct>

How accurate?

10 sample theses, each from a different discipline

 

Extracted list of references (as DOIs) with Grobid

 

Compared with a manually generated list

 

How accurate?

Best - MA Psychology

 

Correct: 77%

False Negative: 9%

False Positive: 14%

 

How accurate?

Average - M.Ed.

Correct: 63%

False Negative: 22%

False Positive: 15%

How accurate?

Worst - M.Sc Chem.

Correct: 0%

False Positive: 0%

False Negative: 100%

How accurate?

Grobid claims ~80% accuracy

tested on PubMed Central articles

https://grobid.readthedocs.io/en/latest/grobid-04-2015.pdf

What to do?

Extract citation data for Library purposes

Make available in repository to users

Use in broader text mining workflows

  • Extracting structured data from text
  • Value-added services for digital repositories
  • Usage statistics / collection assessment

Questions?

jdingle@brocku.ca

slides.com/jdingle/c4ln2017

c4ln2017

By jdingle

c4ln2017

  • 536