The Text Encoding Initiative and best practices for encoding text in digital documentary editions

Dr James Cummings

@jamescummings



http://slides.com/jamescummings/icedd2017

Use space bar to navigate

What is the TEI?

  • The 'TEI' is often used to refer to the Guidelines of the Text Encoding Initiative:
    • 23 chapters of in-depth recommendations for encoding textual phenomena,
    • from all sorts of text,
    • of all historical periods,
    • in all languages,
    • and all writing systems
  • This freely available manual provides a set of regularly maintained and updated recommendations: 'The Guidelines' with definitions, examples, and discussion of over 560 markup distinctions

But the TEI is also:

  • An international consortium of institutions, projects and individual members
  • A community of users and volunteers
  • A mechanism for producing customized schemas for validating your project's digital texts
  • A set of free and openly licensed, customizable tools and stylesheets for transformations to/from many formats (e.g. HTML, Word, PDF, Databases, RDF/LinkedData, Slides, ePub, etc.)
  • A simple consensus-based way of organizing and structuring textual (and other) resources
  • An archival, well-understood, format for long-term preservation of digital data and metadata
  • Whatever you make it! (It is a community-driven standard)

Some TEI History

  • International membership consortium re-established 2000 (see http://www.tei-c.org/)
  • The TEI Guidelines, originally envisaged as a single (large) reference manual, it is now used as a modular resource
  • The Guidelines embody a broad consensus about the significant particularities of a huge range of textual materials expressed both in prose and by means of formal definitions expressable as technical grammar or schemas:
    • TEI P1-P3 (1991-1999) : SGML DTD
    • TEI P4 (2000) : SGML or XML DTD
    • TEI P5 (2007-) XML DTD, W3C Schema, or RelaxNG
  • Since TEI P5 Version 1.0.0 we've moved to more frequent updates (every 6 months or so)

1. The TEI Infrastructure
2. The TEI Header
3. Elements Available in All TEI Documents
4. Default Text Structure
5. Characters, Glyphs, and Writing Modes
6. Verse
7. Performance Texts
8. Transcriptions of Speech
9. Dictionaries
10. Manuscript Description
11. Representation of Primary Sources
12. Critical Apparatus

The TEI Guidelines

13. Names, Dates, People, and Places
14. Tables, Formulæ, Graphics and Notated Music
15. Language Corpora
16. Linking, Segmentation, and Alignment
17. Simple Analytic Mechanisms
18. Feature Structures
19. Graphs, Networks, and Trees
20. Non-hierarchical Structures
21. Certainty, Precision, and Responsibility
22. Documentation Elements
23. Using the TEI

But what does it look like?

Names and Pointing

  • The names of people in documents appear in many forms, in the TEI we mark the names but point to more information about the named entity
  • This helps disambiguate as well: is 'Nancy' a woman or a place name in France? 

Simple Editorial Changes

  • There are many elements for simple editorial changes (such as abbreviations, expansions, corrections, regularisation)  and transcription of primary sources

Customising the TEI

  • The TEI is different from other standards in that you are able to modify it for your own encoding project
  • You can constrain it to be tighter and give encoders less choices, or extend it to deal with areas the TEI hasn't dealt with yet.
  • There is a meta-schema vocabulary of the TEI that enables projects to describe the changes they want to make in a way the TEI processing tools can understand
  • With the TEI a project describes what they want the schema to be and (in a literate programming way) that description is processed into formal schema languages

Possibilities of the

TEI Framework

Project A

Project B

New Elements

  • The ability to interchange many documents improves significantly with a common interchange format
  • Customisation can document the differences in a machine processable format so tools can compare different corpora

- @louburnard

Publishing TEI

  • There are many tools available e.g.:
    • Edition Visualization Technology
    • TEI Boilerplate
    • TEI Critical Edition Toolbox
    • TEI-C Stylesheets
    • OxGarage
    • CETEIcean
    • eXist-db 
    • TAPAS project
  • The tools you use may affect the features you can display to those reading your research and you may have more or less ability to customise

Edition Visualization Technology

  • Easy publication for multi-witness critical editions
  • Critical Edition support: rich and expandable critical apparatus, variant heat map, witnesses collation and variant filtering
  • Bookmark: direct reference to the current view of the web application, page and edition level, collated witnesses and selected apparatus entry
  • High level of customization: the editor can customize both the user interface layout and the appearance of the graphical components
  • https://visualizationtechnology.wordpress.com/

TEI Boilerplate

  • TEI Boilerplate gives in-browser conversion of TEI P5 XML using a simple XSL Stylesheet processing instruction
  • It transforms elements to HTML necessary for display of images, making links clickable, etc
  • Works in all major browsers
  • Works well for small, simple, individual web pages
  • Uses standard customisable CSS but also pays attention to CSS in TEI <rendition> elements
  • Viewing the web page source gives access to your TEI
  • http://teiboilerplate.org/

TEI Critical Apparatus Toolbox

  • Based on TEI Boilerplate
  • The toolbox lets you:
    • Check your encoding: offers facilities to display your edition while it is still in the making, and check the consistency of your encoding
    • Display parallel versions: choose the sigla of the witnesses, and the different versions of the text, following each chosen witness, will be displayed in parallel columns.
  • http://ciham-digital.huma-num.fr/teitoolbox/

TEI-C Stylesheets

  • Freely available, generalised XSLT stylesheets
  • Transformations to and/or from around 40 formats such as:
    • BibTeX, COCOA, CSV, DocBook, DocX (MS Word), DTD, EPub, XSL-FO, HTML, JSON, LaTeX, Markdown, NLM, ODT, PDF, RDF, RelaxNG, RNC, Schematron, Slides, TEI Lite, TEI ODD, TEI P4, TEI simplePrint, TCP, Text, Wordpress, XLSX (MS Excel), XSD
  • Customisable through importing and overwriting templates; Stylesheets repository allows for local 'profiles'
  • TEI-C offers services such as OxGarage which enable pipelined conversion to/from many more formats
  • https://github.com/TEIC/Stylesheets

CETEIcean

  •  CETEIcean is a Javascript (ES6) library that enables TEI P5 XML to be displayed in a web browser without transforming them to HTML
  • Instead it registers them with the browser as Custom Elements
  • Because the elements are treated as HTML, the HTML it produces is valid, and there are not element name collisions (like HTML <p> vs. TEI <p>)
  • http://github.com/TEIC/CETEIcean

Your Edition or Web Page Template

Embedded divisions of custom HTML elements

CETEIcean
​JavaScript

eXist-db TEI Publisher

  • The "instant publishing toolbox" based on eXist-db XML database 
  • Provides easy browsing and search of TEI XML documents initially built for TEI simplePrint
  • Default display is clean and sophisticated  page-by-page display
  • Control of element display is by editing the processing model documentation embedded in the TEI ODD (the TEI customisation format)
  • http://showcases.exist-db.org

TAPAS Project

  • The TAPAS project: TEI Archiving, Publishing, and Access Service hosted by Northeastern University Library's Digital Scholarship Group
  • A free account can contribute to projects and collections in TAPAS to archive, publish, discover or share their TEI files
  • Built in XSLT transformations
  • TEI Members (or paid TAPAS membership) can create collections and projects
  • 1GB of XML file storage for TEI files, TEI ODD Customisations
  • http://tapasproject.org/

TEI Training

Conclusions

  • TEI is a mature, flexible standard suitable for careful documentary editing of primary sources (and many other uses)
  • It is a good format for lots of rich data encoding which can then be exploited, extracted, analysed, or published in many different ways
  • TEI is a standard that you can modify:
    • constrain for consistency between multiple editors
    • extend for areas the TEI hasn't covered yet
    • submit feature requests with your proposed extensions
  • There are many ways to publish your TEI files, the correct way is the one which enables you to express your research and share it with others
  • There is lots of TEI training available

The Text Encoding Initiative and best practices for encoding text in digital documentary editions

By James Cummings

The Text Encoding Initiative and best practices for encoding text in digital documentary editions

A paper for the International Conference of Editors of Diplomatic Documents, 2017-04-18

  • 1,888