TEI Structure
&
Basic Core Components
James Cummings
@jamescummings
http://slides.com/jamescummings/teistructure-core
Thanks as ever to many members of the TEI Community
TEI Structure
(How TEI documents are structured)
Where can I read about this topic?
- Chapters:
- 4 -- Default Text Structure (textstructure)
- 3 -- Elements Available in all TEI documents (core)
- The TEI takes a generalistic approach and should be able to cope with texts
- … of any size
- … language and writing system
- … complexity
- … on all media
- … from every time and place
- such as books, journals, manuscripts, letters, rolls of papyrus, coins,notebooks, postcards, inscription tablets, web pages, etc.
TEI Document Structure(s)
A TEI document is represented by means of:
- the root <TEI> element which contains both, data and metadata or
- a sequence of <TEI>elements may be combined to form a <teiCorpus> element.
- The TEI file may be representing a document of any size, form, or complexity -- from a postcard to a multi-volume encyclopedia
Each <TEI> element could represent a collection of encoded texts, versions of a text, or samples of a language corpora, etc.
<text> can be unitary or composite
- A <text> may be
- Unitary, forming an organic whole
- Composite, consisting of several components which are in some important sens independent of each other
- A unitary text contains:
- <front> optional, contains any prefatory matter, found at the start of a document (titlepage, preface, etc.)
- <body> mandatory, contains the whole body of a single text
- <back> optional, back matter containing appendixes following the main part
Composite <text>
- A composite text contains:
- optional <front>, contains any prefatory matter relating to the composite
- <group> contains at least one text, grouping together distinct texts
- optional <back>, back matter relating to the composite
A group may contain sub-groups
represented by nested <group> elements.
Front Matter
- The front matter <front> represents distinct sections of a text e.g.:
- 'preface' a foreword or preface addressed to the reader
- 'ack' a declaration of acknowledgement by the author
- 'dedication' a fomal dedication to one or more persons
- 'abstract' a summary of the content
- 'contents' a table of contents
- 'frontispiece' pictorial frontispiece, possibly including a text
Because cultural conventions differ as to which elements are grouped as front matter and which as back matter, the content models for the front and back elements are identical.
- <titlePage> the title page of a text, appearing within the front or back matter
- <docTitle> the title of the document
-
<titlePart> subsection or division of the title of a work
- @type specifies the role: e.g. main, sub, alt, short, desc
- <docAuthor> the name of the author
- <docImprint> the imprint statement (place, date, publisher)
- <docDate> the date of the document
Back Matter
-
'appendix' an ancillary self-contained section of a work
-
'glossary' a list of terms associated with definition texts
(<list type="gloss">) -
'notes' a section where notes are gathered together
-
'bibliogr' list of bibliographic citations (<listBibl>)
-
'index' any form of index of the work
-
'colophon' statement describing the physical production of the work
-
and many more...
Previously Printed Index
Of course, there are elements like <index> for marking index entries in the body of a text to enable auto-generation of a detailed index
Global Attributes
Some features (potentially) apply to everything, therefore members of the attribute class att.global can appear in every TEI element:
- @xml:id provides a unique identifier for any element
- @n provides a number or name for an element (not unique)
- @xml:base provides a base URI reference for resolving relative URIs
- @xml:lang specifies the language of any element, using an ISO standard code (e.g. ISO 639-1)
- @xml:space specifies how whitespace should be managed by applications
- @rend, @style and @rendition provide ways of specifying the visual appearance (rendition) of any element (att.global.rendition)
- @resp points to the agency responsible; @cert for certainty
Inside the <body>
Hierarchical grouping of text sequences into textual divisions and subdivisions by means of nested <div> elements.
- Use of the @type attribute to distinguish different kinds of divisions, e.g.
- Epic, Bible → book
- Report → part, section
- Novel → chapter
- Drama → acts, scenes
- Reference book → sections
- Diary → entries
- Newspaper → sections, issues
- and possibly @n to provide a name or number of any kind
Components of a <div>
What do devisions contain (apart from other divisions)?
-
Headings, tagged with <head>
-
Prose, which may be organized as a sequence of
paragraphs <p> -
Poetry, divided into metrical lines <l>, optionally grouped into stanzas <lg>
-
Drama, divided into speeches <sp>, containing an
optional speaker label <speaker>, followed by a mix of <p> or <l> elements, optionally mixed up with stage directions <stage>
Original Layout Information
Within the <text> element the logical view is privileged, but the physical view can be encoded as well through 'empty' elements:
-
<pb /> marks the start of a new page
-
<cb /> marks the start of a new column
-
<lb /> marks the start of a new line
-
<gb /> marks the start of a new gathering
and for other forms of milestone:
- <milestone /> marks to the beginning of a boundary point.
Basic Core Components
(Things lots of documents have)
What is common to most materials?
-
Identification information
-
e.g. shelfmark, inventory number, page number, titles…
-
-
Divisions and subdivisions
-
Pictures, diagrams, some kind of graphical information
-
-
A number of writing modes or registers
-
e.g. prose, verse, drama…
-
-
With formal structural units
-
e.g. paragraphs, lists, stanzas, lines, speeches
-
-
Containing textual distinctions (sometimes signalled by rendition)
-
e.g. titles, headings, quotes, names…
-
-
Metatextual indications/interventions
-
e.g. deletions, additions, annotations, revisions…
-
The TEI core module can cope with this and more phenomena!
Paragraphs
A paragraph is a significant organizational unit for all prose texts
- <p> marks paragraphs in prose
- <p> can contain all the phrase-level elements in the core module
- Phrase-level elements must be entirely contained within a paragraph
- Inter-level elements can appear either within a paragraph or between
- paragraphs (e.g. list, bibiographic citations, etc.)
- Chunks (eg. paragraphs, anonymous block)
Highlighting
Typographic features in order to distinguish passages from its surroundings:
- distinct in some way (e.g. foreign, dialectal, technical, etc.)
- emphatic or stressed when spoken
- not part of the body of the text (e.g. title, head, label, etc.)
- distinct narrative stream (e.g. monologue, commentary, etc.)
- attributed by the narrator to some other agency (e.g. direct speech, quotation, etc.)
- set apart from the text in some other way (e.g. individual names in older texts, editorial corrections or additions, etc.)
Highlighting
<hi> word or phrase which is graphically distinct from the surrounding text
- @rend specifies the visual appearance; the values are defined by each project
- @style, @rendition renditions using external standards, like CSS
Foreign Phrases
-
<foreign> word or phrase not written in the same
language than the surrounding text-
@xml:lang global attribute to specify the language, using an ISO standard code (e.g. ISO 639-1)
-
You may disagree that 'croissant' is foreign word.
Markup is never neutral.
Emphasis
-
<emph> words or phrases which are emphasized for
linguistic or rhetorical effect-
original rendition recorded with: @rend, @rendition and @style
-
Quotation
The TEI distinguishes a variety of 'distinct' text enclosed in quotation marks (or indicated by other means):
- <q> separated from the surrounding text with quotation marks, e.g. direct speech, technical term, slang etc.
- <said> passages thought or spoken aloud
- @direct direct or indirect speech
- @aloud vocalized or signed speech
- <quote> passages attributed to an external source
- <cit> quotation from some other document, together with a bibliographic reference
Simple Editorial Changes
- The core module provides some phrase-level elements which may be used to record simple editorial interventions.
- <choice> groups alternative encodings for the same point in a text
- Abbreviations:
- <abbr> abbreviated form
- <expan> expanded form
- Errors:
- <sic> apparent error
- <corr> corrected error
- Regularization:
- <orig> original form
- <reg> regularized form
- Abbreviations:
Abbreviation and Expansion
You can also show abbreviation markers (<am/>) and expanded text (<ex>)
Emendation and Correction
Regularisation
Addition, Deletion, and Ommisions
-
<add> addition to the text
-
<del> letter, word or phrase marked as deleted in the text
-
<unclear> illegible or inaudible passage which cannot be read with confidence
-
<gap> indicates a point where material is omitted
Names
-
<name> a proper noun or noun phrase
-
<rs> a string referring to some person, place, object, etc.
-
@type attribute specifies the type of the name in more detail
-
Note: Including the namesdates module gives many more name elements (for personal, place, organisational, and geographic names).
Addresses
Elements to distinguish postal and electronic addresses
-
<address> contains a postal address
-
<email> contains an email address
-
<addrLine> a non-specific address line
-
<street> a full street address
-
<postCode> a postal or
zip code -
<postBox> a postal box
number -
<name> can also be
used within address
Numbers and Measures
- <num> a number of any sort, written in any form
- @type and @value
- <measure> marks a quantity and/or commodity
- @type, @unit, @quantity, @commodity
- <measureGrp> a groups of dimensional specifications
Dates and Times
- <date> contains a date in any format @when contains the regularized form; YYYY-MM-DD
- @calendar to specify the calendar system
- <time> contains a time of day in any format
- @when contains the regularized form: HH:MM:SS
(More attributes added if the namesdates module is loaded)
Links and Cross References
-
<ptr> defines a pointer to another location
-
<ref> defines a reference to another location with an
optional linking text-
@target taking a URI reference
-
While <ref> provides link text (though not all references are hyperlinks), <ptr/> is only used for pointers.
Lists
- <list> (a sequence of items forming a list)
- <item> (one component of a list)
- <label> (label associated with an item)
- <headLabel> (heading for a column of labels)
- <headItem> (heading for a column of items)
Graphics
- <graphic> location of an inline graphic, illustration or figure
- <binaryObject> binary data embedding graphics or other objects
Bibliographies
-
<bibl> a structured or unstructured bibliographic entry
-
<title>, <editor>, <title>, <pubPlace>, <publisher>, <date>, etc. for further structuring
-
-
<biblStruct> a structured bibliographic entry
Verse
-
<lg> a formal unit (e.g. stanza) containing one or more verse lines
-
<l> contains a single verse line
The verse module extends this with more elements for metrical analysis.
Drama
- <sp> an individual speech in a performance text, or passages presented as such in prose or verse text
- <speaker> provides the name of one or more speakers in a dramatic text
- <stage> provides stage directions within a dramatic text
The drama module extends this with more elements for dramatic structures like cast lists.
Elements in TextStructure and Core
Core Module:
abbr, add, addrLine, address, analytic, author, bibl, biblScope, biblStruct, binaryObject, cb, choice, cit, citedRange, corr, date, del, desc, distinct, divGen, editor, email, emph, expan, foreign, gap, gb, gloss, graphic, head, headItem, headLabel, hi, imprint, index, item, l, label, lb, lg, list, listBibl, measure, measureGrp, media, meeting, mentioned, milestone, monogr, name, note, num, orig, p, pb, postBox, postCode, ptr, pubPlace, publisher, q, quote, ref, reg, relatedItem, resp, respStmt, rs, said, series, sic, soCalled, sp, speaker, stage, street, teiCorpus, term, textLang, time, title, unclear
TEI Structure and Basic Core Components
By James Cummings
TEI Structure and Basic Core Components
A workshop presentation on the TEI Structure and Basic Core Components
- 2,244