Dr James Cummings
 

Learning How To Fail Better: Resilience in Digital Humanities Projects

James.Cummings@newcastle.ac.uk

@jamescummings

CC+BY   (press space to cycle through slides)

Overview

  • A high-level overview of the TEI
  • Case studies of (excellent) TEI projects:
    • CatCor: Correspondence of Catherine the Great
    • William Godwin’s Diary
    • CURSUS: An Online Resource of Medieval Liturgical Texts
    • Poetic Forms Online: Renaissance to Modern
    • LEAP: Livingstone Online Enhancement and Access Project
    • SRO: Stationers Register Online
  • Building Resilience: Some common sense lessons

Thesis: We can learn from the problems of DH projects to mitigate challenges they face

About the TEI:
A quick high-level refresher

The TEI (The Text Encoding Initiative) is:

  • An international consortium of institutions, projects and individual members
  • A freely available manual of set of regularly maintained and updated recommendations: 'The Guidelines' with definitions, examples, and discussion of over 575 markup distinctions
  • A mechanism for producing customized schemas for validating your project's digital texts or metadata
  • A set of free and openly licensed, customizable tools and stylesheets for transformations to many formats (e.g. HTML, Word, PDF, Databases, RDF/LinkedData, Slides, ePub, etc.)
  • A simple consensus-based way of organizing and structuring textual (and other) resources
  • An archival, well-understood, format for long-term preservation of digital data and metadata
  • Whatever you make it! It is a community of users and volunteers
  • The TEI Consortium manages the development of the TEI Guidelines (and associated software) 
  • It is overseen by an elected Executive Board and its outputs developed by an elected Technical Council
  • People don't have to be members to use the standard but this gives them a vote in elections
  • The TEI Guidelines are updated about every 6 months

Why do we use italic fonts?

Think about the uses for an italic font in any form of printed publication. Why might an author/publisher put some text into italics? What are they signalling about that text?

We can usually tell these types of things apart from context. If we want to use these categories, computers need to be told these things are different.

Some common uses include:

  • titles
  • emphasis
  • foreign phrases
  • technical terms
  • editorial apparatus, captions, cross references
  • quotations, speaker labels in drama
  • speech and thought 

                                                      ... and many more

About XML

<element> Text </element>

<element attribute="value">
Text or child elements here
</element>

<element attribute="value"/>

"Opening Tag"

"Closing Tag"

"Empty Element"

Basic TEI Structure

Usually Encoded as TEI XML

The TEI Guidelines

TEI Customisation

(But I don't need everything the TEI provides, or want something it doesn't give me)

Possibilities of the

TEI Framework

Project B

Project A

New Elements

  • The ability to interchange many documents improves significantly with a common interchange format
  • Customisation can document the differences in a machine processable format so tools can compare different corpora of texts

- @louburnard

A Mental Exercise We Often Give Students

Thinking about this material, and indeed your own, what do you think are the things you would like to mark up?

  • Make a list of textual phenomena and metadata that are important to capture
  • How likely is it that you can mark these up reliably and consistently?
  • Could any of these potentially be marked up automatically by a cleverly crafted bit of software you had someone build?

Pretend an authoritarian anti-intellectual government has come to power and, through a series of bad decisions, has to slash your project funding by 50%.  What do you do?

  • Do you do half the amount of material in the same depth?
  • Markup less?
  • Invest in more semi-automatic markup?
  • Something else?

Repeat the exercise.

Interested In Learning More About The TEI?

 

SAVE THE DATE:

  • Monday 30 March - Friday 3 April 2020
  • Second annual "Textual Editing in the Digital Age" Workshop
  • Registration to open in February
  • Very low registration charge to cover travel of visiting speakers
  • Collaboration between IES and ATNU

CatCor:
Correspondence of
Catherine the Great

http://catcor-dev.oucs.ox.ac.uk

(password protected)

About the CatCor Project

  • Catherine the Great: Empress of Russia 1762 – 1796, prolific letter writer in Russian, French, German, and English
  • University of Oxford internally funded project from a research pump-priming fund to create a proof-of-concept site, hoping to get large research council funding by Professor Ian Kahn and Dr Kelsey Rubin-Detlev in 2013-15
  • 300 letters edited and translated (of around 5000 possible) in TEI
  • Detailed editorial links from any person / place / work / event to local metadata about these
  • TEI customization, consultation, DOCX to TEI conversion scripts, etc. provided free, web developers charged low rate to produce proof-of-concept site, doing most work in front-end javascript

"The pilot database (still behind a firewall) also provides new annotations on the letters; gathers statistics and data tables, and includes a timeline, that makes it possible to browse and filter letters by people, places, events, and objects mentioned in them; sorts letters and sections of letters by theme to reveal new and unexpected connections between various letters; and permits scholars to search the whole corpus or subsets thereof. [...]  Would-be users and browsers are very welcome to get in touch!"

-- Professor Andrew Kahn (2019)

Sounds Great! But...

CatCor Challenges

  • Although pilot project was successful, it did not receive full AHRC funding
  • Never fully launched, Website behind  username / password, access only given to friends of project
  • Website code not available since it was stored in private GitHub repository
  • Web developers moved on to different projects, no active support for the project
  • Lots of bugs which would have been fixed in a full project; No planning for sustainability if not funded
  • Did not use new TEI Correspondence <correspDesc> element because this was just under development at the time. (This would have been corrected in a full project)

William
Godwin’s
Diary

About the Godwin's Diary Project

  • Godwin was a political philosopher and writer, Mary Wollstonecraft’s husband and Mary Shelley’s father
  • University of Oxford project (2007-2010) to create digital edition of William Godwin’s Diary with funding for project from Leverhulme Trust
  • Diaries purchased with Abinger Collection based on National Heritage Memorial Fund and donations
  • 48 years of diaries in 32 octavo notebooks, written in highly abbreviated daily entries
    • People’s names often given as initials
    • little detail of substance of meetings
    • networks of relationships with people, and aggregate lists of information able to extracted from richly encoded TEI

How did the project get done?

  • I trained PI, RA, and 2 PhD students in TEI in 1.5 days
    • But had customised the TEI to be about 15 custom elements total
    • These were automatically converted back to 'pure' TEI for display and dissemination
  • Bespoke website built on top of early version of
    eXist-DB (a native XML Database)
  • Encoders worked in phases adding structural markup, then meetings, then names, etc. 
  • Each phase they started with a year they had not seen before, proofreading each others work
  • As technical consultant I was on hand to answer all and any technical problems

Godwin Diary Project Challenges

  • No funding direct to Bodleian library to support, only funding/donations to purchase Abinger Collection
  • Not adopted into Bodleian virtual machine infrastructure during project development 
  • Single developer (me) who continued to support on best-effort basis after project ended in 2010
  • Developer, PI, Research Associates, etc. all now at other institutions
  • Hosted on old virtual machine infrastructure, software needs occasional restart, potential security problems
  • Did not use IIIF (or related standards) but created bespoke pan/zoom image browser using dated Google Maps API
  • Until November 2019, underlying data and code not in open repository.

CURSUS:
An Online Resource of Medieval Liturgical Texts

About the Cursus Project

  • AHRB-funded project (2000-2003) at University of East Anglia to produce resource of medieval liturgical texts and explore XML publication possibilities
  • Principal Investigator Professor David Chadd and Research Assistant Dr James Cummings produced editions of 12 medieval manuscripts
  • Desire of research project to investigate and compare order of antiphons, responds, and prayers in these manuscripts which detail order of service in different places in England
  • Project produced full copy of Corpus Antiphonalium Officii, Vulgate Bible, and other supplementary information 

Cursus Project Challenges

  • 2000-3 – Main Cursus project completed UEA Music Dept.
  • 2003 – I moved Oxford, project continues with Richard Lewis taking over technical development for 3 years
  • 2006 – Sadly, in November 2006 the Principal Investigator Professor David Chadd died
  • 2009 – ‘Climategate’ (hacking of emails relating to climate change data) caused UEA to close all off-campus access
  • 2010 – Richard and I unable to access departmetnal server; it is later replaced without  Cursus project website. Eventually university department is removed.
  • 2016 – After 6 years of negotiation I get confirmation of CC+BY+NC license of data, allowing Richard and I to put it up elsewhere
  • TEI P4 XML data was always safe but (until 2016) not stored in open repository, although declared as 'freely available' on original site it had not been explicitly licensed

Poetic Forms Online:
Renaissance to Modern

About the Poetic Forms Online Project

  • University of Oxford minimally funded pilot project by Dr Elizabeth Scott-Baumann and Dr Ben Burton
  • Repurposed EEBO-TCP texts converted to TEI P5 XML
  • Produced a browsable, searchable, database of verse focusing on poetic form, especially:
    • rhyme (including rhyme scheme, rhyme words, rhyme type)
    • metre and syllabification 
    • overall genre
  • Starting with Renaissance texts it planned to cover exemplary texts from Renaissance to Modern Day
  • In production view of XML, every line is tagged with detailed information about rhyme and metrical structure enabling a powerful faceted search

EEBO-TCP Text

"Automatic" Up-Converted TEI P5

Poetic Forms Online Challenges

  • Proof-of-concept internal pump-priming funding meant limit time/resources/support
  • One of the researchers departed to another institution, the other departed from higher education
  • Although developers offered to move hosting, no further work has been done on it so only two texts: Shakespeare's Sonnets and Venus and Adonis.
  • Given limited funding, team used Drupal for frontend presentation rather than an XML-based solution
  • Data not stored in public repository, but private repository owned by individuals in the institution (now departed)
  • Use of Drupal Feeds module for reading XML files meant redundant generation of large files duplicating all possible information for every single line

LEAP:
Livingstone Online Enhancement
and Access Project

About the LEAP Project

  • Project, led by Dr Adrian Wisnicki (UNL), 2013-2017 to:
    • award winning project to re-develop the Livingstone Online website,
    • update all underlying materials to TEI P5 XML under a single TEI customization, and
    • produce critical edition of David Livingstone’s final manuscripts (1865-73), including multi-witness texts
    • created detailed project documentation, including full TEI P5 ODD customization, information about funding, including project difficulties and lessons learned
    • Multi-spectral imaging of difficult to read texts
  • All materials released openly, more than just a digital edition, but an archive of all related material including project materials and reports

LEAP Project Challenges

  • Planned alpha launch (March 2015) plagued with problems (UCLA developers difficulties in implementing in their chosen solution of Islandora in conjunction with Fedora backend)
  • Other project partners did additional work before beta launch, development proceeded in halting fashion, lots of missed deadlines, failed to meet expectations of agreed specification
  • After beta launch LEAP team made hard decision to ask UCLA to leave the project, negotiated departure over end of 2015
  • LEAP reached agreement for hosting with MITH (Maryland Institute for Technology in the Humanities) at University of Maryland and additional developers
  • Transfer of project in 2016 only possible because of detailed documentation of materials, project specifications, and project reports mentioning these problems

SRO:
Stationers
Register
Online

About the SRO Project

  • A project from University of Oxford and Bath Spa University led by Giles Bergel and Ian Gadd to transcribe first three stationers registers (1557 – 1620)
  • These are an invaluable sources for english book history and central to the development of copyright.
  • They record the right to print from 1557 until modern day
  • Minimal funding to create the underlying data in phase 1 (2013) meant keying company made many inconsistencies in creating the TEI files
  • A phase 2 project (2016) sought major AHRC funding but was unsuccessful; it scraped together minimal funding from CREATe: the RCUK Centre for Copyright and New Business Models in the Creative Economy
  • CREATe also provided in-kind contributions of a developer responsible for creating a new website. 

SRO Project Challenges

  • SRO has had a number of problems mostly relating to under-funding. Only having a minimal budget means it could not pay for proper quality control in Phase 1
  • The Phase 2 project only employed editors & proofreaders for short period to correct this
  • Web development was provided by CREATe (UK Copyright and Creative Economy Centre, based at the University of Glasgow) as an in-kind contribution by a PhD student
  • Unfortunately they did not have experience of requested technology (eXist-db XML Database) and so did not fully exploit its potential; thus javascript for faceted browsing is painfully slow to use.
  • The developer now has got their PhD and so is no longer working for CREATe
  • Website eventually launched a couple years late

Learning to Fail Better

All of old. Nothing else ever. Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.

First the body. No. First the place. No. First both. Now either. Now the other. Sick of the either try the other. Sick of it back sick of the either. So on. Somehow on. Till sick of both. Throw up and go. Where neither. Till sick of there. Throw up and back. The body again. Where none. The place again. Where none. Try again. Fail again. Better again. Or better worse. Fail worse again. Still worse again. Till sick for good. Throw up for good. Go for good. Where neither for good. Good and all.

Samuel Beckett -- Worstward Ho

Building Resilience: Problems

  • CatCor: Closed development, minimal pilot funding, no sustainability plan, limitations of technology choices, years later still closed website
  • William Godwin’s Diary: Lack of integration of support by institution, lack of sustainability funding, closed development, departure of staff
  • CURSUS: Death of PI then server, Climategate, lack of clear licensing or institutional support
  • Poetic Forms Online: Closed development, minimal pilot funding, no sustainability plan, limitations of tech (drupal) choices, departure of staff
  • LEAP: Multi-institutional project difficulties, saved by in-depth documentation and specifications
  • SRO: Minimal pilot funding for phase 1, almost no funding for phase 2, in-kind contributions dictated technology choices, departure of staff

Building Resilience: Documentation

  • Project Documentation: create detailed internal project documentation and share this openly including documenting all project working practices and assumptions, desired technical specifications, agendas/minutes of meetings
  • Technical Documentation: Document use of international standards and variation from them (c.f. TEI ODD), technical frameworks, software dependencies
  • MoU: Always have memorandum of understanding with institutions and other partners (such as developers) with clear milestones and responsibilities on both sides
  • Plan for failure: Make open records of worst-case scenario planning and ensure all partner institutions understand them (e.g. the institution understands them, not just the partner representative)

Building Resilience: Work In The Light

​Projects tend to hide away their work, not wanting to show work-in-progress until it is finished. It is better in the long run if they work in the light, work openly making as many internal project materials available openly to the greater community. Where feasible minimal requirements:

  • Always give access to the underlying data 
  • Pre-license all outputs with open licenses
  • Provide latest data and website code in open repositories 
  • Work in inter-institutional collaborative manner, not relying on one institution’s policies but joint agreements
  • Give a method for community to provide feedback, improvements, or make derivative works of the data​ 

Building Resilience: Infrastructure

  • Standards: Ensure technical decisions use appropriate open international standards and community-supported open source software and APIs
  • Workarounds: Do not implement quick workarounds (or if you must, document them in detail)  
  • Many eyes: Have multiple technical partners overseeing / validating each others work (through review of pull-requests, regular reporting)
  • Infrastructure: Integrate into institutional (or multi-institutional) infrastructural support so servers will go on running, be updated, for many years
  • Institution: Ensure the institution understands its commitments to maintain outputs for X years
  • Partners: If feasible, research software engineers should be partners in project, contributing best practice, not just solution providers

But, so many other possible lessons
and ways to build resilience...

Or later:
james.cummings@newcastle.ac.uk
tw: @jamescummings
http://slides.com/jamescummings/cdcs2020

Happy to answer questions now!

Learning How To Fail Better: Resilience in Digital Humanities Projects

By James Cummings

Learning How To Fail Better: Resilience in Digital Humanities Projects

Learning How To Fail Better: Resilience in Digital Humanities Projects; A talk given on 17 January 2020 at the Centre for Data, Culture, and Society at University of Edinburgh.

  • 2,008