TEI Structure
&
Basic Core Components

James Cummings
@jamescummings

http://slides.com/jamescummings/teistructure-core

Thanks as ever to many members of the TEI Community

TEI Structure

(How TEI documents are structured)

Where can I read about this topic?

  • Chapters:
    • 4 -- Default Text Structure (textstructure)
    • 3 -- Elements Available in all TEI documents (core)
  • The TEI takes a generalistic approach and should be able to cope with texts
    • … of any size
    • … language and writing system
    • … complexity
    • … on all media
    • … from every time and place
  • such as books, journals, manuscripts, letters, rolls of papyrus, coins,notebooks, postcards, inscription tablets, web pages, etc.

TEI Document Structure(s)

A TEI document is represented by means of:

  • the root <TEI> element which contains both, data and metadata or
  • a sequence of <TEI>elements may be combined to form a <teiCorpus> element.

 

  • The TEI file may be representing a document of any size, form, or complexity -- from a postcard to a multi-volume encyclopedia

Each <TEI> element could represent a collection of encoded texts, versions of a text, or samples of a language corpora, etc.

<text> can be unitary or composite

  • A <text> may be
    • Unitary, forming an organic whole
    • Composite, consisting of several components which are in some important sens independent of each other
  • A unitary text contains:
  • <front> optional, contains any prefatory matter, found at the start of a document (titlepage, preface, etc.)
  • <body> mandatory, contains the whole body of a single text
  • <back> optional, back matter containing appendixes following the main part

Composite <text>

  • A composite text contains:
    • optional <front>, contains any prefatory matter relating to the composite
    • <group> contains at least one text, grouping together distinct texts
    • optional <back>, back matter relating to the composite

 

 

A group may contain sub-groups
represented by nested <group> elements. 

Front Matter

  • The front matter <front> represents distinct sections of a text e.g.:
    • 'preface' a foreword or preface addressed to the reader
    • 'ack' a declaration of acknowledgement by the author
    • 'dedication' a fomal dedication to one or more persons
    • 'abstract' a summary of the content
    • 'contents' a table of contents
    • 'frontispiece' pictorial frontispiece, possibly including a text

Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the front and back elements are identical.

  • <titlePage> the title page of a text, appearing within the front or back matter
  • <docTitle> the title of the document
  • <titlePart> subsection or division of the title of a work
    • @type specifies the role: e.g. main, sub, alt, short, desc
  • <docAuthor> the name of the author
  • <docImprint> the imprint statement (place, date, publisher)
  • <docDate> the date of the document

Back Matter

  • 'appendix' an ancillary self-contained section of a work

  • 'glossary' a list of terms associated with definition texts
    (<list type="gloss">)

  • 'notes' a section where notes are gathered together

  • 'bibliogr' list of bibliographic citations (<listBibl>)

  • 'index' any form of index of the work

  • 'colophon' statement describing the physical production of the work

  • and many more...

Previously Printed Index

Of course, there are elements like <index> for marking index entries in the body of a text to enable auto-generation of a detailed index

Global Attributes

Some features (potentially) apply to everything, therefore members of the attribute class att.global can appear in every TEI element:

  • @xml:id provides a unique identifier for any element
  • @n provides a number or name for an element (not unique)
  • @xml:base provides a base URI reference for resolving relative URIs
  • @xml:lang specifies the language of any element, using an ISO standard code (e.g. ISO 639-1)
  • @xml:space specifies how whitespace should be managed by applications
  • @rend, @style and @rendition provide ways of specifying the visual appearance (rendition) of any element (att.global.rendition)
  • @resp points to the agency responsible; @cert for certainty

Inside the <body>

Hierarchical grouping of text sequences into textual divisions and subdivisions by means of nested <div> elements.

  • Use of the @type attribute to distinguish different kinds of divisions, e.g.
    • Epic, Bible → book
    • Report → part, section 
    • Novel → chapter
    • Drama → acts, scenes
    • Reference book → sections
    • Diary → entries
    • Newspaper → sections, issues
  • and possibly @n to provide a name or number of any kind

Components of a <div>

What do devisions contain (apart from other divisions)?

  • Headings, tagged with <head>

  • Prose, which may be organized as a sequence of
    paragraphs <p>

  • Poetry, divided into metrical lines <l>, optionally grouped into stanzas <lg>

  • Drama, divided into speeches <sp>, containing an
    optional speaker label <speaker>, followed by a mix of <p> or <l> elements, optionally mixed up with stage directions <stage>

Original Layout Information

Within the <text> element the logical view is privileged, but the physical view can be encoded as well through 'empty' elements:

  • <pb /> marks the start of a new page

  • <cb /> marks the start of a new column

  • <lb /> marks the start of a new line

  • <gb /> marks the start of a new gathering

 

and for other forms of milestone:

  • <milestone /> marks to the beginning of a boundary point.

 

Basic Core Components

(Things lots of documents have)

What is common to most materials?

  • Identification information

    • e.g. shelfmark, inventory number, page number, titles…

  • Divisions and subdivisions

    • Pictures, diagrams, some kind of graphical information

  • A number of writing modes or registers

    • e.g. prose, verse, drama…

  • With formal structural units

    • e.g. paragraphs, lists, stanzas, lines, speeches

  • Containing textual distinctions (sometimes signalled by rendition)

    • e.g. titles, headings, quotes, names…

  • Metatextual indications/interventions

    • e.g. deletions, additions, annotations, revisions…

The TEI core module can cope with this and more phenomena!

Paragraphs

A paragraph is a significant organizational unit for all prose texts

  • <p> marks paragraphs in prose
  • <p> can contain all the phrase-level elements in the core module
    • Phrase-level elements must be entirely contained within a paragraph
    • Inter-level elements can appear either within a paragraph or between
    • paragraphs (e.g. list, bibiographic citations, etc.)
    • Chunks (eg. paragraphs, anonymous block)

Highlighting

Typographic features in order to distinguish passages from its surroundings:

  • distinct in some way (e.g. foreign, dialectal, technical, etc.)
  • emphatic or stressed when spoken
  • not part of the body of the text (e.g. title, head, label, etc.)
  • distinct narrative stream (e.g. monologue, commentary, etc.)
  • attributed by the narrator to some other agency (e.g. direct speech, quotation, etc.)
  • set apart from the text in some other way (e.g. individual names in older texts, editorial corrections or additions, etc.)

Highlighting

<hi> word or phrase which is graphically distinct from the surrounding text

  • @rend specifies the visual appearance; the values are defined by each project
  • @style, @rendition renditions using external standards, like CSS

Foreign Phrases

  • <foreign> word or phrase not written in the same
    language than the surrounding text

    • @xml:lang global attribute to specify the language, using an ISO standard code (e.g. ISO 639-1)

You may disagree that 'croissant' is foreign word.
Markup is never neutral.

Emphasis

  • <emph> words or phrases which are emphasized for
    linguistic or rhetorical effect

    • original rendition recorded with: @rend, @rendition and @style

Quotation

The TEI distinguishes a variety of 'distinct' text enclosed in quotation marks (or indicated by other means):

  • <q> separated from the surrounding text with quotation marks, e.g. direct speech, technical term, slang etc.
  • <said> passages thought or spoken aloud
    • @direct direct or indirect speech
    • @aloud vocalized or signed speech
  • <quote> passages attributed to an external source
  • <cit> quotation from some other document, together with a bibliographic reference

Simple Editorial Changes

  • The core module provides some phrase-level elements which may be used to record simple editorial interventions.
  • <choice> groups alternative encodings for the same point in a text
    • Abbreviations:
      • ​<abbr> abbreviated form
      • ​<expan> expanded form
    • ​Errors:
      • <sic> apparent error
      • ​<corr> corrected error
    • ​Regularization:
      • <orig> original form
      • <reg> regularized form  

Abbreviation and Expansion

You can also show abbreviation markers (<am/>) and expanded text (<ex>)

Emendation and Correction

Regularisation

Addition, Deletion, and Ommisions

  • <add> addition to the text

  • <del> letter, word or phrase marked as deleted in the text

  • <unclear> illegible or inaudible passage which cannot be read with confidence

  • <gap> indicates a point where material is omitted

Names

  • <name> a proper noun or noun phrase

  • <rs> a string referring to some person, place, object, etc.

    • @type attribute specifies the type of the name in more detail

Note: Including the namesdates module gives many more name elements (for personal, place, organisational, and geographic names).

Addresses

Elements to distinguish postal and electronic addresses

  • <address> contains a postal address

  • <email> contains an email address

  • <addrLine> a non-specific address line

  • <street> a full street address

  • <postCode> a postal or
    zip code

  • <postBox> a postal box
    number

  • <name> can also be
    used within address

Numbers and Measures

  • <num> a number of any sort, written in any form
    • @type and @value
  • <measure> marks a quantity and/or commodity
    • @type, @unit, @quantity, @commodity
  • <measureGrp> a groups of dimensional specifications

Dates and Times

  • <date> contains a date in any format @when contains the regularized form; YYYY-MM-DD
    • @calendar to specify the calendar system
  • <time> contains a time of day in any format
    • @when contains the regularized form: HH:MM:SS

 

(More attributes added if the namesdates module is loaded)  

Links and Cross References

  • <ptr> defines a pointer to another location

  • <ref> defines a reference to another location with an
    optional linking text

    • @target taking a URI reference

While <ref> provides link text (though not all references are hyperlinks), <ptr/> is only used for pointers. 

Lists

  • <list>  (a sequence of items forming a list)
  • <item>  (one component of a list)
  • <label>  (label associated with an item)
  • <headLabel>  (heading for a column of labels)
  • <headItem>  (heading for a column of items)

Graphics

  • <graphic> location of an inline graphic, illustration or figure
  • <binaryObject> binary data embedding graphics or other objects

Bibliographies

  • <bibl> a structured or unstructured bibliographic entry

    • <title>, <editor>, <title>, <pubPlace>, <publisher>, <date>, etc. for further structuring

  • <biblStruct> a structured bibliographic entry

     

     

Verse

  • <lg> a formal unit (e.g. stanza) containing one or more verse lines

  • <l> contains a single verse line

The verse module extends this with more elements for metrical analysis.

Drama

  • <sp> an individual speech in a performance text, or passages presented as such in prose or verse text
  • <speaker> provides the name of one or more speakers in a dramatic text
  • <stage> provides stage directions within a dramatic text

The drama module extends this with more elements for dramatic structures like cast lists.

Elements in TextStructure and Core

TEI Structure and Basic Core Components

By James Cummings

TEI Structure and Basic Core Components

A workshop presentation on the TEI Structure and Basic Core Components

  • 2,229