Dr James Cummings

http://slides.com/jamescummings/cdcs2020

Learning How To Fail Better: Resilience in Digital Humanities Projects

James.Cummings@newcastle.ac.uk

@jamescummings

CC+BY (press space to cycle through slides)

Overview

A high-level overview of the TEI
Case studies of (excellent) TEI projects:
- CatCor: Correspondence of Catherine the Great
- William Godwin’s Diary
- CURSUS: An Online Resource of Medieval Liturgical Texts
- Poetic Forms Online: Renaissance to Modern
- LEAP: Livingstone Online Enhancement and Access Project
- SRO: Stationers Register Online
Building Resilience: Some common sense lessons

Thesis: We can learn from the problems of DH projects to mitigate challenges they face

About the TEI:
A quick high-level refresher

The TEI (The Text Encoding Initiative) is:

An international consortium of institutions, projects and individual members
A freely available manual of set of regularly maintained and updated recommendations: 'The Guidelines' with definitions, examples, and discussion of over 575 markup distinctions
A mechanism for producing customized schemas for validating your project's digital texts or metadata
A set of free and openly licensed, customizable tools and stylesheets for transformations to many formats (e.g. HTML, Word, PDF, Databases, RDF/LinkedData, Slides, ePub, etc.)
A simple consensus-based way of organizing and structuring textual (and other) resources
An archival, well-understood, format for long-term preservation of digital data and metadata
Whatever you make it! It is a community of users and volunteers

The TEI Consortium manages the development of the TEI Guidelines (and associated software)
It is overseen by an elected Executive Board and its outputs developed by an elected Technical Council
People don't have to be members to use the standard but this gives them a vote in elections
The TEI Guidelines are updated about every 6 months

Why do we use italic fonts?

Think about the uses for an italic font in any form of printed publication. Why might an author/publisher put some text into italics? What are they signalling about that text?

We can usually tell these types of things apart from context. If we want to use these categories, computers need to be told these things are different.

Some common uses include:

titles
emphasis
foreign phrases
technical terms
editorial apparatus, captions, cross references
quotations, speaker labels in drama
speech and thought

... and many more

About XML

<element attribute="value">
Text or child elements here
</element>

"Opening Tag"

"Closing Tag"

"Empty Element"

Basic TEI Structure

Usually Encoded as TEI XML

http://www.tei-c.org/release/doc/tei-p5-doc/en/html/

The TEI Guidelines

TEI Customisation

(But I don't need everything the TEI provides, or want something it doesn't give me)

Possibilities of the

TEI Framework

Project B

Project A

New Elements

http://romabeta.tei-c.org/

The ability to interchange many documents improves significantly with a common interchange format
Customisation can document the differences in a machine processable format so tools can compare different corpora of texts

- @louburnard

A Mental Exercise We Often Give Students

Thinking about this material, and indeed your own, what do you think are the things you would like to mark up?

Make a list of textual phenomena and metadata that are important to capture
How likely is it that you can mark these up reliably and consistently?
Could any of these potentially be marked up automatically by a cleverly crafted bit of software you had someone build?

Pretend an authoritarian anti-intellectual government has come to power and, through a series of bad decisions, has to slash your project funding by 50%. What do you do?

Do you do half the amount of material in the same depth?
Markup less?
Invest in more semi-automatic markup?
Something else?

Repeat the exercise.

Interested In Learning More About The TEI?

SAVE THE DATE:

Monday 30 March - Friday 3 April 2020
Second annual "Textual Editing in the Digital Age" Workshop
Registration to open in February
Very low registration charge to cover travel of visiting speakers
Collaboration between IES and ATNU

CatCor:
Correspondence of
Catherine the Great

http://catcor-dev.oucs.ox.ac.uk

(password protected)

About the CatCor Project

Catherine the Great: Empress of Russia 1762 – 1796, prolific letter writer in Russian, French, German, and English
University of Oxford internally funded project from a research pump-priming fund to create a proof-of-concept site, hoping to get large research council funding by Professor Ian Kahn and Dr Kelsey Rubin-Detlev in 2013-15
300 letters edited and translated (of around 5000 possible) in TEI
Detailed editorial links from any person / place / work / event to local metadata about these
TEI customization, consultation, DOCX to TEI conversion scripts, etc. provided free, web developers charged low rate to produce proof-of-concept site, doing most work in front-end javascript

"The pilot database (still behind a firewall) also provides new annotations on the letters; gathers statistics and data tables, and includes a timeline, that makes it possible to browse and filter letters by people, places, events, and objects mentioned in them; sorts letters and sections of letters by theme to reveal new and unexpected connections between various letters; and permits scholars to search the whole corpus or subsets thereof. [...] Would-be users and browsers are very welcome to get in touch!"

-- Professor Andrew Kahn (2019)

Sounds Great! But...

CatCor Challenges

Although pilot project was successful, it did not receive full AHRC funding
Never fully launched, Website behind username / password, access only given to friends of project
Website code not available since it was stored in private GitHub repository
Web developers moved on to different projects, no active support for the project
Lots of bugs which would have been fixed in a full project; No planning for sustainability if not funded
Did not use new TEI Correspondence <correspDesc> element because this was just under development at the time. (This would have been corrected in a full project)

William
Godwin’s
Diary

http://godwindiary.bodleian.ox.ac.uk

About the Godwin's Diary Project

Godwin was a political philosopher and writer, Mary Wollstonecraft’s husband and Mary Shelley’s father
University of Oxford project (2007-2010) to create digital edition of William Godwin’s Diary with funding for project from Leverhulme Trust
Diaries purchased with Abinger Collection based on National Heritage Memorial Fund and donations
48 years of diaries in 32 octavo notebooks, written in highly abbreviated daily entries
- People’s names often given as initials
- little detail of substance of meetings
- networks of relationships with people, and aggregate lists of information able to extracted from richly encoded TEI

How did the project get done?

I trained PI, RA, and 2 PhD students in TEI in 1.5 days
- But had customised the TEI to be about 15 custom elements total
- These were automatically converted back to 'pure' TEI for display and dissemination
Bespoke website built on top of early version of
eXist-DB (a native XML Database)
Encoders worked in phases adding structural markup, then meetings, then names, etc.
Each phase they started with a year they had not seen before, proofreading each others work
As technical consultant I was on hand to answer all and any technical problems

Godwin Diary Project Challenges

No funding direct to Bodleian library to support, only funding/donations to purchase Abinger Collection
Not adopted into Bodleian virtual machine infrastructure during project development
Single developer (me) who continued to support on best-effort basis after project ended in 2010
Developer, PI, Research Associates, etc. all now at other institutions
Hosted on old virtual machine infrastructure, software needs occasional restart, potential security problems
Did not use IIIF (or related standards) but created bespoke pan/zoom image browser using dated Google Maps API
Until November 2019, underlying data and code not in open repository.

CURSUS:
An Online Resource of Medieval Liturgical Texts

Original URL: http://www.cursus.uea.ac.uk/
Working URL: http://www.cursus.org.uk/

About the Cursus Project

AHRB-funded project (2000-2003) at University of East Anglia to produce resource of medieval liturgical texts and explore XML publication possibilities
Principal Investigator Professor David Chadd and Research Assistant Dr James Cummings produced editions of 12 medieval manuscripts
Desire of research project to investigate and compare order of antiphons, responds, and prayers in these manuscripts which detail order of service in different places in England
Project produced full copy of Corpus Antiphonalium Officii, Vulgate Bible, and other supplementary information

Cursus Project Challenges

2000-3 – Main Cursus project completed UEA Music Dept.
2003 – I moved Oxford, project continues with Richard Lewis taking over technical development for 3 years
2006 – Sadly, in November 2006 the Principal Investigator Professor David Chadd died
2009 – ‘Climategate’ (hacking of emails relating to climate change data) caused UEA to close all off-campus access
2010 – Richard and I unable to access departmetnal server; it is later replaced without Cursus project website. Eventually university department is removed.
2016 – After 6 years of negotiation I get confirmation of CC+BY+NC license of data, allowing Richard and I to put it up elsewhere
TEI P4 XML data was always safe but (until 2016) not stored in open repository, although declared as 'freely available' on original site it had not been explicitly licensed

Poetic Forms Online:
Renaissance to Modern

http://www.poeticformsonline.org/

About the Poetic Forms Online Project

University of Oxford minimally funded pilot project by Dr Elizabeth Scott-Baumann and Dr Ben Burton
Repurposed EEBO-TCP texts converted to TEI P5 XML
Produced a browsable, searchable, database of verse focusing on poetic form, especially:
- rhyme (including rhyme scheme, rhyme words, rhyme type)
- metre and syllabification
- overall genre
Starting with Renaissance texts it planned to cover exemplary texts from Renaissance to Modern Day
In production view of XML, every line is tagged with detailed information about rhyme and metrical structure enabling a powerful faceted search

EEBO-TCP Text

"Automatic" Up-Converted TEI P5

Poetic Forms Online Challenges

Proof-of-concept internal pump-priming funding meant limit time/resources/support
One of the researchers departed to another institution, the other departed from higher education
Although developers offered to move hosting, no further work has been done on it so only two texts: Shakespeare's Sonnets and Venus and Adonis.
Given limited funding, team used Drupal for frontend presentation rather than an XML-based solution
Data not stored in public repository, but private repository owned by individuals in the institution (now departed)
Use of Drupal Feeds module for reading XML files meant redundant generation of large files duplicating all possible information for every single line

LEAP:
Livingstone Online Enhancement
and Access Project

http://www.livingstoneonline.org/

About the LEAP Project

Project, led by Dr Adrian Wisnicki (UNL), 2013-2017 to:
- award winning project to re-develop the Livingstone Online website,
- update all underlying materials to TEI P5 XML under a single TEI customization, and
- produce critical edition of David Livingstone’s final manuscripts (1865-73), including multi-witness texts
- created detailed project documentation, including full TEI P5 ODD customization, information about funding, including project difficulties and lessons learned
- Multi-spectral imaging of difficult to read texts
All materials released openly, more than just a digital edition, but an archive of all related material including project materials and reports

LEAP Project Challenges

Planned alpha launch (March 2015) plagued with problems (UCLA developers difficulties in implementing in their chosen solution of Islandora in conjunction with Fedora backend)
Other project partners did additional work before beta launch, development proceeded in halting fashion, lots of missed deadlines, failed to meet expectations of agreed specification
After beta launch LEAP team made hard decision to ask UCLA to leave the project, negotiated departure over end of 2015
LEAP reached agreement for hosting with MITH (Maryland Institute for Technology in the Humanities) at University of Maryland and additional developers
Transfer of project in 2016 only possible because of detailed documentation of materials, project specifications, and project reports mentioning these problems

SRO:
Stationers
Register
Online

http://stationersregister.online

About the SRO Project

A project from University of Oxford and Bath Spa University led by Giles Bergel and Ian Gadd to transcribe first three stationers registers (1557 – 1620)
These are an invaluable sources for english book history and central to the development of copyright.
They record the right to print from 1557 until modern day
Minimal funding to create the underlying data in phase 1 (2013) meant keying company made many inconsistencies in creating the TEI files
A phase 2 project (2016) sought major AHRC funding but was unsuccessful; it scraped together minimal funding from CREATe: the RCUK Centre for Copyright and New Business Models in the Creative Economy
CREATe also provided in-kind contributions of a developer responsible for creating a new website.

SRO Project Challenges

SRO has had a number of problems mostly relating to under-funding. Only having a minimal budget means it could not pay for proper quality control in Phase 1
The Phase 2 project only employed editors & proofreaders for short period to correct this
Web development was provided by CREATe (UK Copyright and Creative Economy Centre, based at the University of Glasgow) as an in-kind contribution by a PhD student
Unfortunately they did not have experience of requested technology (eXist-db XML Database) and so did not fully exploit its potential; thus javascript for faceted browsing is painfully slow to use.
The developer now has got their PhD and so is no longer working for CREATe
Website eventually launched a couple years late

Learning to Fail Better

All of old. Nothing else ever. Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.

First the body. No. First the place. No. First both. Now either. Now the other. Sick of the either try the other. Sick of it back sick of the either. So on. Somehow on. Till sick of both. Throw up and go. Where neither. Till sick of there. Throw up and back. The body again. Where none. The place again. Where none. Try again. Fail again. Better again. Or better worse. Fail worse again. Still worse again. Till sick for good. Throw up for good. Go for good. Where neither for good. Good and all.

Samuel Beckett -- Worstward Ho

Building Resilience: Problems

CatCor: Closed development, minimal pilot funding, no sustainability plan, limitations of technology choices, years later still closed website
William Godwin’s Diary: Lack of integration of support by institution, lack of sustainability funding, closed development, departure of staff
CURSUS: Death of PI then server, Climategate, lack of clear licensing or institutional support
Poetic Forms Online: Closed development, minimal pilot funding, no sustainability plan, limitations of tech (drupal) choices, departure of staff
LEAP: Multi-institutional project difficulties, saved by in-depth documentation and specifications
SRO: Minimal pilot funding for phase 1, almost no funding for phase 2, in-kind contributions dictated technology choices, departure of staff

Building Resilience: Documentation

Project Documentation: create detailed internal project documentation and share this openly including documenting all project working practices and assumptions, desired technical specifications, agendas/minutes of meetings
Technical Documentation: Document use of international standards and variation from them (c.f. TEI ODD), technical frameworks, software dependencies
MoU: Always have memorandum of understanding with institutions and other partners (such as developers) with clear milestones and responsibilities on both sides
Plan for failure: Make open records of worst-case scenario planning and ensure all partner institutions understand them (e.g. the institution understands them, not just the partner representative)

Building Resilience: Work In The Light

Projects tend to hide away their work, not wanting to show work-in-progress until it is finished. It is better in the long run if they work in the light, work openly making as many internal project materials available openly to the greater community. Where feasible minimal requirements:

Always give access to the underlying data
Pre-license all outputs with open licenses
Provide latest data and website code in open repositories
Work in inter-institutional collaborative manner, not relying on one institution’s policies but joint agreements
Give a method for community to provide feedback, improvements, or make derivative works of the data

Building Resilience: Infrastructure

Standards: Ensure technical decisions use appropriate open international standards and community-supported open source software and APIs
Workarounds: Do not implement quick workarounds (or if you must, document them in detail)
Many eyes: Have multiple technical partners overseeing / validating each others work (through review of pull-requests, regular reporting)
Infrastructure: Integrate into institutional (or multi-institutional) infrastructural support so servers will go on running, be updated, for many years
Institution: Ensure the institution understands its commitments to maintain outputs for X years
Partners: If feasible, research software engineers should be partners in project, contributing best practice, not just solution providers

But, so many other possible lessons
and ways to build resilience...

Or later:
james.cummings@newcastle.ac.uk
tw: @jamescummings
http://slides.com/jamescummings/cdcs2020

Happy to answer questions now!

Learning How To Fail Better: Resilience in Digital Humanities Projects

Overview

About the TEI: A quick high-level refresher

The TEI (The Text Encoding Initiative) is:

Why do we use italic fonts?

About XML

Basic TEI Structure

The TEI Guidelines

TEI Customisation

A Mental Exercise We Often Give Students

Interested In Learning More About The TEI?

CatCor: Correspondence of Catherine the Great

About the CatCor Project

Sounds Great! But...

CatCor Challenges

William Godwin’s Diary

About the Godwin's Diary Project

How did the project get done?

Godwin Diary Project Challenges

CURSUS: An Online Resource of Medieval Liturgical Texts

About the Cursus Project

Cursus Project Challenges

Poetic Forms Online: Renaissance to Modern

About the Poetic Forms Online Project

EEBO-TCP Text

"Automatic" Up-Converted TEI P5

Poetic Forms Online Challenges

LEAP: Livingstone Online Enhancement and Access Project

About the LEAP Project

LEAP Project Challenges

SRO: Stationers Register Online

About the SRO Project

SRO Project Challenges

Learning to Fail Better

Building Resilience: Problems

Building Resilience: Documentation

Building Resilience: Work In The Light

Building Resilience: Infrastructure

But, so many other possible lessons and ways to build resilience...

Learning How To Fail Better: Resilience in Digital Humanities Projects

More from James Cummings

About the TEI:
A quick high-level refresher

CatCor:
Correspondence of
Catherine the Great

William
Godwin’s
Diary

CURSUS:
An Online Resource of Medieval Liturgical Texts

Poetic Forms Online:
Renaissance to Modern

LEAP:
Livingstone Online Enhancement
and Access Project

SRO:
Stationers
Register
Online

But, so many other possible lessons
and ways to build resilience...