TEI Structure
&
Basic Core Components
Dr James Cummings
@jamescummings
http://slides.com/jamescummings/tei-structure-core-mca
Thanks as ever to many members of the TEI Community
TEI Document Structure(s)
A TEI document is represented by means of:
- the root <TEI> element which contains both, data and metadata or
- a sequence of <TEI>elements may be combined to form a <teiCorpus> element.
- The TEI file may be representing a document of any size, form, or complexity -- from a postcard to a multi-volume encyclopedia
- A TEI Corpus may be a collection of TEI files, a library catalogue, a linguistic corpus
Each element could represent a collection of encoded texts, versions of a text, or samples of a language corpora, etc.
Front Matter
- The front matter <front> represents distinct sections of a text e.g.:
- 'preface' a foreword or preface addressed to the reader
- 'ack' a declaration of acknowledgement by the author
- 'dedication' a fomal dedication to one or more persons
- 'abstract' a summary of the content
- 'contents' a table of contents
- 'frontispiece' pictorial frontispiece, possibly including a text
Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the front and back elements are identical.
- <titlePage> the title page of a text, appearing within the front or back matter
- <docTitle> the title of the document
-
<titlePart> subsection or division of the title of a work
- @type specifies the role: e.g. main, sub, alt, short, desc
- <docAuthor> the name of the author
- <docImprint> the imprint statement (place, date, publisher)
- <docDate> the date of the document
Back Matter
-
'appendix' an ancillary self-contained section of a work
-
'glossary' a list of terms associated with definition texts
(<list type="gloss">) -
'notes' a section where notes are gathered together
-
'bibliogr' list of bibliographic citations (<listBibl>)
-
'index' any form of index of the work
-
'colophon' statement describing the physical production of the work
-
and many more...
For Example: Previously Printed Index
Global Attributes
Some features (potentially) apply to everything, therefore members of the attribute class att.global can appear in every TEI element:
- @xml:id provides a unique identifier for any element
- @n provides a number or name for an element (not unique)
- @xml:base provides a base URI reference for resolving relative URIs
- @xml:lang specifies the language of any element, using an ISO standard code (e.g. ISO 639-1)
- @xml:space specifies how whitespace should be managed by applications
- @rend, @style and @rendition provide ways of specifying the visual appearance (rendition) of any element (att.global.rendition)
- @resp points to the agency responsible; @cert for certainty
Inside the <body>
Hierarchical grouping of text sequences into textual divisions and subdivisions by means of nested <div> elements.
- Use the @type attribute to distinguish different kinds of <div> divisions
- Epic, Bible → book
- Report → part, section
- Novel → chapter
- Drama → acts, scenes
- Reference book → sections
- Diary → entries
- Newspaper → sections, issues
- and possibly @n to provide a name or number
Components of a <div>
What do devisions contain (apart from other divisions)?
-
Headings, tagged with <head>
-
Prose, which may be organized as a sequence of
paragraphs <p> -
Poetry, divided into metrical lines <l>, optionally grouped into stanzas <lg>
-
Drama, divided into speeches <sp>, containing an
optional speaker label <speaker>, followed by a mix of <p> or <l> elements, optionally mixed up with stage directions <stage>
Original Layout Information
Within the <text> element the logical view is privileged, but the physical view can be encoded as well through 'empty' elements:
-
<pb/> marks the start of a new page
-
<cb/> marks the start of a new column
-
<lb/> marks the start of a new line
-
<gb/> marks the start of a new gathering
and for other forms of milestone:
- <milestone/> marks to the beginning of a boundary point.
Paragraphs
A paragraph is a significant organizational unit for all prose texts
- <p> marks paragraphs in prose
- <p> can contain all the phrase-level elements in the core module
- Phrase-level elements must be entirely contained within a paragraph
- Inter-level elements can appear either within a paragraph or between
- paragraphs (e.g. list, bibiographic citations, etc.)
- Chunks (eg. paragraphs, anonymous block)
Highlighting
<hi> word or phrase which is graphically distinct from the surrounding text
- @rend specifies the visual appearance; the values are defined by each project
- @style, @rendition renditions using external standards, like CSS
Foreign Phrases
-
<foreign> word or phrase not written in the same
language than the surrounding text-
@xml:lang global attribute to specify the language, using an ISO standard code (e.g. ISO 639-1)
-
Emphasis
-
<emph> words or phrases which are emphasized for
linguistic or rhetorical effect-
original rendition recorded with: @rend, @rendition and @style
-
Quotation
The TEI distinguishes a variety of 'distinct' text enclosed in quotation marks (or indicated by other means):
- <q> separated from the surrounding text with quotation marks, e.g. direct speech, technical term, slang etc.
- <said> passages thought or spoken aloud
- @direct direct or indirect speech
- @aloud vocalized or signed speech
- <quote> passages attributed to an external source
- <cit> quotation from some other document, together with a bibliographic reference
Simple Editorial Changes
- The core module provides some phrase-level elements which may be used to record simple editorial interventions.
- <choice> groups alternative encodings for the same point in a text
- Abbreviations:
- <abbr> abbreviated form
- <expan> expanded form
- Errors:
- <sic> apparent error
- <corr> corrected error
- Regularization:
- <orig> original form
- <reg> regularized form
- Abbreviations:
Regularisation
Emendation and Correction
versus
Addition, Deletion, and Ommisions
-
<add> addition to the text
-
<del> letter, word or phrase marked as deleted in the text
-
<unclear> illegible or inaudible passage which cannot be read with confidence
-
<gap> indicates a point where material is omitted
Names
-
<name> a proper noun or noun phrase
-
<rs> a string referring to some person, place, object, etc.
-
@type attribute specifies the type of the name in more detail
-
Note: Including the namesdates module gives many more name elements (for personal, place, organisational, and geographic names).
Addresses
Elements to distinguish postal and electronic addresses
-
<address> contains a postal address
-
<email> contains an email address
-
<addrLine> a non-specific address line
-
<street> a full street address
-
<postCode> a postal or
zip code -
<postBox> a postal box
number -
<name> can also be
used within address
Dates and Times
- <date> contains a date in any format @when contains the regularized form; YYYY-MM-DD
- @calendar to specify the calendar system
- <time> contains a time of day in any format
- @when contains the regularized form: HH:MM:SS
(More attributes added if the namesdates module is loaded)
Lists
- <list> (a sequence of items forming a list)
- <item> (one component of a list)
- <label> (label associated with an item)
- <headLabel> (heading for a column of labels)
- <headItem> (heading for a column of items)
Bibliographies
-
<bibl> a structured or unstructured bibliographic entry
-
<title>, <editor>, <title>, <pubPlace>, <publisher>, <date>, etc. for further structuring
-
-
<biblStruct> a structured bibliographic entry
Verse
-
<lg> a formal unit (e.g. stanza) containing one or more verse lines
-
<l> contains a single verse line
The verse module extends this with more elements for metrical analysis.
Drama
- <sp> an individual speech in a performance text, or passages presented as such in prose or verse text
- <speaker> provides the name of one or more speakers in a dramatic text
- <stage> provides stage directions within a dramatic text
The drama module extends this with more elements for dramatic structures like cast lists.
TEI Text Structure Module Elements:
TEI Core Module Elements:
-
abbr add addrLine address analytic author bibl biblScope biblStruct binaryObject cb choice cit citedRange corr date del desc distinct divGen
editor email emph expan foreign gap gb gloss graphic head headItem
headLabel hi imprint index item l label lb lglist listBibl measure measureGrp
media meeting mentioned milestone monogr name note num orig p pb
postBox postCode ptr pubPlace publisher q quote ref reg relatedItem resp
respStmt rs said series sic soCalled sp speaker stage street teiCorpus term
textLang time title unclear
TEI Structure and Basic Core Components - MCA
By James Cummings
TEI Structure and Basic Core Components - MCA
A workshop presentation on the TEI Structure and Basic Core Components
- 1,895