OBJECTIVES

  • Use full-text availability of journal Éire-Ireland for years 1994-2017, recent releases of public-use Library of Congress MARC XML files, and the publicly-available HathiTrust Digital Library Hathi Files to supplement our understanding of trends in historiography/scholarship in Irish Studies

  • Who is being cited? Who is shaping the field (broadly defined)?

  • Measures of heterogeneity in citation in particular (are we citing the same people again and again? Is that bad?)

  • Are libraries prone to over-collecting certain scholars?

Non-OBJECTIVES

  • Focus on scholarly investigation of Ireland -as-subject

  • Not a look at what Irish authors were cited or collected (c.f.  Brian Lavoie and Lorcan Dempsey,  "An Exploration of the Irish Presence in the Published Record,"  OCLC, 2018)--at least not for library holdings

  • However, we can get a sense of heterogeneity of literary subjects and authors via citations in Éire-Ireland

Corpus Overview

  • Published by Irish American Cultural Institute (founded 1962), first issue 1966
  • St. Paul, MN, to Morristown, NJ, in 1995
  • Special themed issues (c. early 2000s on), with guest editors starting 36 1&2 (~28 guest editors, 2000-17)

Éire-Ireland

Corpus Overview

  • 23 years of volumes, 1994-2017 (vols. 29-52)
  • Yields 11,011 pages of main essay texts:
    • 696 Essays/Poems/Translations
    • 7,405 pages with footnotes or works cited
    • 1.9 million words
    • 454 unique authors
  • Online publishers:
    • Project Muse, JHU (1994 - present)
    • EBSCOhost and Gale Cengage (via print subscription publisher; 1998 - present)

Back Archive

  • Total number of unique authors: 454

 

  • Average number of appearances: 1.19

 

  • Gender ratio: 282 males (62%) to 172 females (38%)

 

Corpus Overview

Corpus Prep Workflow

  • Extract embedded text and HTML using Xpdf (open-source PDF suite) pdftotext utility  and pdftohtml utilities.

 

  • Assemble metadata for each issue from filename, and for each essay from file contents

Text Treatment

Corpus Prep Workflow

  • Scrape page text from body of plain-text files (due to inferior quality of HTML main text); pull footnotes from HTML

 

  • Marry footnotes to page text, join essay and issue metadata to pages

Text Treatment

Corpus Prep Workflow

  • Initial preparation workflow had struggled with detecting start of components within footnotes -- easy to identify start of note, but not subsequent citations in same note.

 

  • Labeling of citation components beyond first two-to-three tokens. In initial 2017 report, only these tokens, where identified as personal names, could be used

Updates, 2017-2019

Corpus Prep Workflow

  • Implementation of conditional random fields (CRF) in 2018: create training data in which components of footnotes were labeled (~20 footnotes, or ~700 components)

 

  • "Score" unlabeled footnote components based on token features, e.g.:
    • Is it a number? Does it have four digits and start with viable century?
    • Is it a delimiter (, ; : .)
    • Is it capitalized?
    • Does it have quotations marks around it?
    • What type of tokens  surround it?

Updates, 2017-2019

Corpus Prep Workflow

  • Use probabilistic modeling (specifically, Python CRFSuite to assign each microcomponent a label based on its similarity to components of labeled training data.
  • Results: some improvement:
    • Good in separating journal articles from books
    • Enables us to grab subsequent citations in single note number
  • But continued problems:
    • Separate and identify constituents in multi-author works
    • Ignore non-source components, e.g. direct-source quotations being confused for journal article titles (need to implement with further training data)

Updates, 2017-2019

Recap: 2017 Findings

Newspapers & Periodicals

"First-position," full-citation footnotes, name-like tokens

n = 21,850 footnotes

Name Number
Irish Times 100
Irish Independent 54
Freeman's Journal 50
The Nation 43
United Irishman 32
The Toiler 32

Recap: 2017 Findings

Name Number
Maria Edgeworth 30
Sean O'Casey 27
Ernie O'Malley 23
Eamon de Valera 23
John Mitchel 17
Seamus Heaney 14
James Joyce
14

Primary Source Names

Name Number Average Fn Location
Garret FitzGerald 23 70.0
David Fitzpatrick 13 42.9
Seamus Deane 13 21.23
Tom Garvin 11 35.5
Terence Brown 10 32.4
James Kelly 9 50.3

Scholars: By Number of Full-Citation, Head-of-Note

Recap: 2017 Findings

Name Number Average Fn Location
Edward Said 4 5.75
Joel Mokyr 4 6.0
Joanna Bourke 5 6.6
Michel Foucault 5 7.0
Kerby Miller 4 11.5
Kevin Whelan 6 12.1

Scholars: By Average Placement of First Footnote

Recap: 2017 Findings

Name Number
The Nation 83
Times (Irish?) 42
Cork Examiner 39
United Irishmen 31
The Toiler 31
Belfast News-Letter 20

Newspapers & Periodicals

CRF-LabelED Data

Citation components labeled "authors"

n = 21,850 footnotes; ~9,000 "authors"

Name Number
John Mitchel 51
W B Yeats 41
Jonathan Swift 39
Seamus Heaney 30
James Joyce 26
Brendan Behan 17

Subject-Authors

CRF-LabelED Data

Name Number
Roy Foster 51
Seamus Deane 45
Alvin Jackson 40
Garret FitzGerald 31
David Fitzpatrick 24
FSL Lyons 20

Scholars

CRF-LabelED Data

Library of Congress

  • MARC XML "Open Access" Distribution, "Book Files" (see http://loc.gov/cds/products/marcDist.php)
  • Number of records: 10 million +
  • LoC: This is an undercount
  • Extract built on any title field (MARC 245) with "Irish" or "Ireland", n = 18,576
  • Considered only scholar-authors here

 

Library of Congress

Name Number of Books
Edward MacLysaght 18
Peter Harbison 18
Michael C. O'Laughlin 17
Padraic O'Farrell 15
Morgan Llywelyn 14
Donald Akenson 14

Book Holdings w/ title "Irish" or "Ireland", Scholar-Authors

HathiTrust Digital LIbrary

  • Dominated by R1 (especially U Michigan, University of California System) holdings
  • Disproportionate representation of pre-1924 holdings (libraries initially afraid to digitize copyright holdings)
  • Number of records: 16 million +
  • No authors(!) in public use Hathi files
  • Extract built on any title field (HT takes from MARC 245) with "Irish" or "Ireland", n = 35,937

 

Name # Volumes
Leaders of Public Opinion in Ireland , W.E.H. Lecky (1903) 12
History of Ireland : from the Anglo-Norman invasion till the union of the country with Great Britain, W.C. Taylor (1833) 8
Outlines of the history of Ireland from the earliest times..., P.W. Joyce (1904) 5
The Course of Irish History, ed. Moody/Martin 5

Book Holdings w/ title "Irish" or "Ireland", by Title

HathiTrust Digital LIbrary

Takeaways

  • Scholar-authors are not as diverse, citation-wise as library holdings enable.
  • Who gets cited is dominated by a shorter list of male historians and literary critics
  • ...but, the proportion of footnotes covered by each of these authors is still small (~1.25%)
  • Generalist histories predominate, and key edited collections (e.g. Field Day Anthology)
  • Over-reliance on Dublin-based national newspapers
  • Predominance of author-subjects of usual suspects (Yeats, Swift, Joyce) but also some that overlap with historians' interests (Mitchel)

 

Made with Slides.com