TEI Structure
&
Basic Core Components

Dr James Cummings


@jamescummings

http://slides.com/jamescummings/tei-structure-core-mca

Thanks as ever to many members of the TEI Community

TEI Document Structure(s)

A TEI document is represented by means of:

  • the root <TEI> element which contains both, data and metadata or
  • a sequence of <TEI>elements may be combined to form a <teiCorpus> element.

 

 

  • The TEI file may be representing a document of any size, form, or complexity -- from a postcard to a multi-volume encyclopedia
  • A TEI Corpus may be a collection of TEI files, a library catalogue, a linguistic corpus

Each element could represent a collection of encoded texts, versions of a text, or samples of a language corpora, etc.

Front Matter

  • The front matter <front> represents distinct sections of a text e.g.:
    • 'preface' a foreword or preface addressed to the reader
    • 'ack' a declaration of acknowledgement by the author
    • 'dedication' a fomal dedication to one or more persons
    • 'abstract' a summary of the content
    • 'contents' a table of contents
    • 'frontispiece' pictorial frontispiece, possibly including a text

Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the front and back elements are identical.

  • <titlePage> the title page of a text, appearing within the front or back matter
  • <docTitle> the title of the document
  • <titlePart> subsection or division of the title of a work
    • @type specifies the role: e.g. main, sub, alt, short, desc
  • <docAuthor> the name of the author
  • <docImprint> the imprint statement (place, date, publisher)
  • <docDate> the date of the document

Back Matter

  • 'appendix' an ancillary self-contained section of a work

  • 'glossary' a list of terms associated with definition texts
    (<list type="gloss">)

  • 'notes' a section where notes are gathered together

  • 'bibliogr' list of bibliographic citations (<listBibl>)

  • 'index' any form of index of the work

  • 'colophon' statement describing the physical production of the work

  • and many more...

For Example: Previously Printed Index

Global Attributes

Some features (potentially) apply to everything, therefore members of the attribute class att.global can appear in every TEI element:

  • @xml:id provides a unique identifier for any element
  • @n provides a number or name for an element (not unique)
  • @xml:base provides a base URI reference for resolving relative URIs
  • @xml:lang specifies the language of any element, using an ISO standard code (e.g. ISO 639-1)
  • @xml:space specifies how whitespace should be managed by applications
  • @rend, @style and @rendition provide ways of specifying the visual appearance (rendition) of any element (att.global.rendition)
  • @resp points to the agency responsible; @cert for certainty

Inside the <body>

Hierarchical grouping of text sequences into textual divisions and subdivisions by means of nested <div> elements.

  • Use the @type attribute to distinguish different kinds of <div> divisions
    • Epic, Bible → book
    • Report → part, section 
    • Novel → chapter
    • Drama → acts, scenes
    • Reference book → sections
    • Diary → entries
    • Newspaper → sections, issues
  • and possibly @n to provide a name or number 

Components of a <div>

What do devisions contain (apart from other divisions)?

  • Headings, tagged with <head>

  • Prose, which may be organized as a sequence of
    paragraphs <p>

  • Poetry, divided into metrical lines <l>, optionally grouped into stanzas <lg>

  • Drama, divided into speeches <sp>, containing an
    optional speaker label <speaker>, followed by a mix of <p> or <l> elements, optionally mixed up with stage directions <stage>

Original Layout Information

Within the <text> element the logical view is privileged, but the physical view can be encoded as well through 'empty' elements:

  • <pb/> marks the start of a new page

  • <cb/> marks the start of a new column

  • <lb/> marks the start of a new line

  • <gb/> marks the start of a new gathering

 

and for other forms of milestone:

  • <milestone/> marks to the beginning of a boundary point.

 

Paragraphs

A paragraph is a significant organizational unit for all prose texts

  • <p> marks paragraphs in prose
  • <p> can contain all the phrase-level elements in the core module
    • Phrase-level elements must be entirely contained within a paragraph
    • Inter-level elements can appear either within a paragraph or between
    • paragraphs (e.g. list, bibiographic citations, etc.)
    • Chunks (eg. paragraphs, anonymous block)

Highlighting

<hi> word or phrase which is graphically distinct from the surrounding text

  • @rend specifies the visual appearance; the values are defined by each project
  • @style, @rendition renditions using external standards, like CSS

Foreign Phrases

  • <foreign> word or phrase not written in the same
    language than the surrounding text

    • @xml:lang global attribute to specify the language, using an ISO standard code (e.g. ISO 639-1)

Emphasis

  • <emph> words or phrases which are emphasized for
    linguistic or rhetorical effect

    • original rendition recorded with: @rend, @rendition and @style

Quotation

The TEI distinguishes a variety of 'distinct' text enclosed in quotation marks (or indicated by other means):

  • <q> separated from the surrounding text with quotation marks, e.g. direct speech, technical term, slang etc.
  • <said> passages thought or spoken aloud
    • @direct direct or indirect speech
    • @aloud vocalized or signed speech
  • <quote> passages attributed to an external source
  • <cit> quotation from some other document, together with a bibliographic reference

Simple Editorial Changes

  • The core module provides some phrase-level elements which may be used to record simple editorial interventions.
  • <choice> groups alternative encodings for the same point in a text
    • Abbreviations:
      • ​<abbr> abbreviated form
      • ​<expan> expanded form
    • ​Errors:
      • <sic> apparent error
      • ​<corr> corrected error
    • ​Regularization:
      • <orig> original form
      • <reg> regularized form  

Regularisation

Emendation and Correction

versus

Addition, Deletion, and Ommisions

  • <add> addition to the text

  • <del> letter, word or phrase marked as deleted in the text

  • <unclear> illegible or inaudible passage which cannot be read with confidence

  • <gap> indicates a point where material is omitted

Names

  • <name> a proper noun or noun phrase

  • <rs> a string referring to some person, place, object, etc.

    • @type attribute specifies the type of the name in more detail

Note: Including the namesdates module gives many more name elements (for personal, place, organisational, and geographic names).

Addresses

Elements to distinguish postal and electronic addresses

  • <address> contains a postal address

  • <email> contains an email address

  • <addrLine> a non-specific address line

  • <street> a full street address

  • <postCode> a postal or
    zip code

  • <postBox> a postal box
    number

  • <name> can also be
    used within address

Dates and Times

  • <date> contains a date in any format @when contains the regularized form; YYYY-MM-DD
    • @calendar to specify the calendar system
  • <time> contains a time of day in any format
    • @when contains the regularized form: HH:MM:SS

 

(More attributes added if the namesdates module is loaded)  

Lists

  • <list>  (a sequence of items forming a list)
  • <item>  (one component of a list)
  • <label>  (label associated with an item)
  • <headLabel>  (heading for a column of labels)
  • <headItem>  (heading for a column of items)

Bibliographies

  • <bibl> a structured or unstructured bibliographic entry

    • <title>, <editor>, <title>, <pubPlace>, <publisher>, <date>, etc. for further structuring

  • <biblStruct> a structured bibliographic entry

     

     

Verse

  • <lg> a formal unit (e.g. stanza) containing one or more verse lines

  • <l> contains a single verse line

The verse module extends this with more elements for metrical analysis.

Drama

  • <sp> an individual speech in a performance text, or passages presented as such in prose or verse text
  • <speaker> provides the name of one or more speakers in a dramatic text
  • <stage> provides stage directions within a dramatic text

The drama module extends this with more elements for dramatic structures like cast lists.

TEI Text Structure Module Elements:

TEI Core Module Elements: