TEI Structure
&
Basic Core Components

James Cummings
@jamescummings

http://slides.com/jamescummings/teistructure-core

Thanks as ever to many members of the TEI Community

TEI Structure

(How TEI documents are structured)

Where can I read about this topic?

  • Chapters:
    • 4 -- Default Text Structure (textstructure)
    • 3 -- Elements Available in all TEI documents (core)
  • The TEI takes a generalistic approach and should be able to cope with texts
    • … of any size
    • … language and writing system
    • … complexity
    • … on all media
    • … from every time and place
  • such as books, journals, manuscripts, letters, rolls of papyrus, coins,notebooks, postcards, inscription tablets, web pages, etc.

TEI Document Structure(s)

A TEI document is represented by means of:

  • the root <TEI> element which contains both, data and metadata or
  • a sequence of <TEI>elements may be combined to form a <teiCorpus> element.

 

  • The TEI file may be representing a document of any size, form, or complexity -- from a postcard to a multi-volume encyclopedia

Each <TEI> element could represent a collection of encoded texts, versions of a text, or samples of a language corpora, etc.

<text> can be unitary or composite

  • A <text> may be
    • Unitary, forming an organic whole
    • Composite, consisting of several components which are in some important sens independent of each other
  • A unitary text contains:
  • <front> optional, contains any prefatory matter, found at the start of a document (titlepage, preface, etc.)
  • <body> mandatory, contains the whole body of a single text
  • <back> optional, back matter containing appendixes following the main part

Composite <text>

  • A composite text contains:
    • optional <front>, contains any prefatory matter relating to the composite
    • <group> contains at least one text, grouping together distinct texts
    • optional <back>, back matter relating to the composite

 

 

A group may contain sub-groups
represented by nested <group> elements. 

Front Matter

  • The front matter <front> represents distinct sections of a text e.g.:
    • 'preface' a foreword or preface addressed to the reader
    • 'ack' a declaration of acknowledgement by the author
    • 'dedication' a fomal dedication to one or more persons
    • 'abstract' a summary of the content
    • 'contents' a table of contents
    • 'frontispiece' pictorial frontispiece, possibly including a text

Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the front and back elements are identical.

  • <titlePage> the title page of a text, appearing within the front or back matter
  • <docTitle> the title of the document
  • <titlePart> subsection or division of the title of a work
    • @type specifies the role: e.g. main, sub, alt, short, desc
  • <docAuthor> the name of the author
  • <docImprint> the imprint statement (place, date, publisher)
  • <docDate> the date of the document

Back Matter

  • 'appendix' an ancillary self-contained section of a work

  • 'glossary' a list of terms associated with definition texts
    (<list type="gloss">)

  • 'notes' a section where notes are gathered together

  • 'bibliogr' list of bibliographic citations (<listBibl>)

  • 'index' any form of index of the work

  • 'colophon' statement describing the physical production of the work

  • and many more...

Previously Printed Index

Of course, there are elements like <index> for marking index entries in the body of a text to enable auto-generation of a detailed index

Global Attributes

Some features (potentially) apply to everything, therefore members of the attribute class att.global can appear in every TEI element:

  • @xml:id provides a unique identifier for any element
  • @n provides a number or name for an element (not unique)
  • @xml:base provides a base URI reference for resolving relative URIs
  • @xml:lang specifies the language of any element, using an ISO standard code (e.g. ISO 639-1)
  • @xml:space specifies how whitespace should be managed by applications
  • @rend, @style and @rendition provide ways of specifying the visual appearance (rendition) of any element (att.global.rendition)
  • @resp points to the agency responsible; @cert for certainty

Inside the <body>

Hierarchical grouping of text sequences into textual divisions and subdivisions by means of nested <div> elements.

  • Use of the @type attribute to distinguish different kinds of divisions, e.g.
    • Epic, Bible → book
    • Report → part, section 
    • Novel → chapter
    • Drama → acts, scenes
    • Reference book → sections
    • Diary → entries
    • Newspaper → sections, issues
  • and possibly @n to provide a name or number of any kind

Components of a <div>

What do devisions contain (apart from other divisions)?

  • Headings, tagged with <head>

  • Prose, which may be organized as a sequence of
    paragraphs <p>

  • Poetry, divided into metrical lines <l>, optionally grouped into stanzas <lg>

  • Drama, divided into speeches <sp>, containing an
    optional speaker label <speaker>, followed by a mix of <p> or <l> elements, optionally mixed up with stage directions <stage>

Original Layout Information

Within the <text> element the logical view is privileged, but the physical view can be encoded as well through 'empty' elements:

  • <pb /> marks the start of a new page

  • <cb /> marks the start of a new column

  • <lb /> marks the start of a new line

  • <gb /> marks the start of a new gathering

 

and for other forms of milestone:

  • <milestone /> marks to the beginning of a boundary point.

 

Basic Core Components

(Things lots of documents have)

What is common to most materials?

  • Identification information

    • e.g. shelfmark, inventory number, page number, titles…

  • Divisions and subdivisions

    • Pictures, diagrams, some kind of graphical information

  • A number of writing modes or registers

    • e.g. prose, verse, drama…

  • With formal structural units

    • e.g. paragraphs, lists, stanzas, lines, speeches

  • Containing textual distinctions (sometimes signalled by rendition)

    • e.g. titles, headings, quotes, names…

  • Metatextual indications/interventions

    • e.g. deletions, additions, annotations, revisions…

The TEI core module can cope with this and more phenomena!

Paragraphs

A paragraph is a significant organizational unit for all prose texts

  • <p> marks paragraphs in prose
  • <p> can contain all the phrase-level elements in the core module
    • Phrase-level elements must be entirely contained within a paragraph
    • Inter-level elements can appear either within a paragraph or between
    • paragraphs (e.g. list, bibiographic citations, etc.)
    • Chunks (eg. paragraphs, anonymous block)

Highlighting

Typographic features in order to distinguish passages from its surroundings:

  • distinct in some way (e.g. foreign, dialectal, technical, etc.)
  • emphatic or stressed when spoken
  • not part of the body of the text (e.g. title, head, label, etc.)
  • distinct narrative stream (e.g. monologue, commentary, etc.)
  • attributed by the narrator to some other agency (e.g. direct speech, quotation, etc.)
  • set apart from the text in some other way (e.g. individual names in older texts, editorial corrections or additions, etc.)

Highlighting

<hi> word or phrase which is graphically distinct from the surrounding text

  • @rend specifies the visual appearance; the values are defined by each project
  • @style, @rendition renditions using external standards, like CSS

Foreign Phrases

  • <foreign> word or phrase not written in the same
    language than the surrounding text

    • @xml:lang global attribute to specify the language, using an ISO standard code (e.g. ISO 639-1)

You may disagree that 'croissant' is foreign word.
Markup is never neutral.

Emphasis

  • <emph> words or phrases which are emphasized for
    linguistic or rhetorical effect

    • original rendition recorded with: @rend, @rendition and @style

Quotation

The TEI distinguishes a variety of 'distinct' text enclosed in quotation marks (or indicated by other means):

  • <q> separated from the surrounding text with quotation marks, e.g. direct speech, technical term, slang etc.
  • <said> passages thought or spoken aloud
    • @direct direct or indirect speech
    • @aloud vocalized or signed speech
  • <quote> passages attributed to an external source
  • <cit> quotation from some other document, together with a bibliographic reference

Simple Editorial Changes

  • The core module provides some phrase-level elements which may be used to record simple editorial interventions.
  • <choice> groups alternative encodings for the same point in a text
    • Abbreviations:
      • ​<abbr> abbreviated form
      • ​<expan> expanded form
    • ​Errors:
      • <sic> apparent error
      • ​<corr> corrected error
    • ​Regularization:
      • <orig> original form
      • <reg> regularized form  

Abbreviation and Expansion

You can also show abbreviation markers (<am/>) and expanded text (<ex>)

Emendation and Correction

Regularisation

Addition, Deletion, and Ommisions

  • <add> addition to the text

  • <del> letter, word or phrase marked as deleted in the text

  • <unclear> illegible or inaudible passage which cannot be read with confidence

  • <gap> indicates a point where material is omitted

Names

  • <name> a proper noun or noun phrase

  • <rs> a string referring to some person, place, object, etc.

    • @type attribute specifies the type of the name in more detail

Note: Including the namesdates module gives many more name elements (for personal, place, organisational, and geographic names).

Addresses

Elements to distinguish postal and electronic addresses

  • <address> contains a postal address

  • <email> contains an email address

  • <addrLine> a non-specific address line

  • <street> a full street address

  • <postCode> a postal or
    zip code

  • <postBox> a postal box
    number

  • <name> can also be
    used within address

Numbers and Measures

  • <num> a number of any sort, written in any form
    • @type and @value
  • <measure> marks a quantity and/or commodity
    • @type, @unit, @quantity, @commodity
  • <measureGrp> a groups of dimensional specifications

Dates and Times

  • <date> contains a date in any format @when contains the regularized form; YYYY-MM-DD
    • @calendar to specify the calendar system
  • <time> contains a time of day in any format
    • @when contains the regularized form: HH:MM:SS

 

(More attributes added if the namesdates module is loaded)  

Links and Cross References

  • <ptr> defines a pointer to another location

  • <ref> defines a reference to another location with an
    optional linking text

    • @target taking a URI reference

While <ref> provides link text (though not all references are hyperlinks), <ptr/> is only used for pointers. 

Lists

  • <list>  (a sequence of items forming a list)
  • <item>  (one component of a list)
  • <label>  (label associated with an item)
  • <headLabel>  (heading for a column of labels)
  • <headItem>  (heading for a column of items)

Graphics

  • <graphic> location of an inline graphic, illustration or figure
  • <binaryObject> binary data embedding graphics or other objects

Bibliographies

  • <bibl> a structured or unstructured bibliographic entry

    • <title>, <editor>, <title>, <pubPlace>, <publisher>, <date>, etc. for further structuring

  • <biblStruct> a structured bibliographic entry

     

     

Verse

  • <lg> a formal unit (e.g. stanza) containing one or more verse lines

  • <l> contains a single verse line

The verse module extends this with more elements for metrical analysis.

Drama

  • <sp> an individual speech in a performance text, or passages presented as such in prose or verse text
  • <speaker> provides the name of one or more speakers in a dramatic text
  • <stage> provides stage directions within a dramatic text

The drama module extends this with more elements for dramatic structures like cast lists.

Elements in TextStructure and Core