TEI for Transcription, Editing, and Representing Primary Sources
Dr James Cummings
@jamescummings
http://slides.com/jamescummings/tei-for-primary-sources
Press space to move through slides; CC+by; Thanks to TEI Community
Transcription: A special kind of reading?
- Goals of transcription:
- to make a primary source accessible
- and comprehensible
- which may entail adding and/or using additional information
- Hence:
- transcription is selective
- transcription is interpretative
- transcription is subjective and depends on the editors' decisions
Just like the application of markup to a text!
The act of transcribing
- Inspect the original document or an image (facsimile) of it
- Identify areas on the document like text, graphics, etc.
- Find the first line
- Start typing the first line, identify special characters
- Record textual modifications/interventions like highlighting, (interlinear) additions, deletions, transpositions, etc.
- Identify text-structures
- Recognize writing activities
- Identify named entities like persons, places, etc.
At what point does transcription become editing?
The act of transcribing
- Inspect the original document or an image (facsimile) of it → <facsimile> or <sourceDoc>
- Identify areas on the document like text, graphics, etc.
→ <surface> or <zone> etc. - Find the first line → <line> etc.
- Start typing the first line, identify special characters → <g> or <hi> etc.
- Record textual modifications/interventions like highlighting, (interlinear) additions, deletions, transpositions, etc. → <hi> or <mod> etc.
- Identify text-structures → <div>, <head>, <p>, <list> etc.
- Recognise writing activities → <add>, <del>, or <abbr> etc.
- Identify named entities like persons, places, etc. → <name>, <persName>, or <placeName> etc.
What‘s in a transcription?
When transcribing primary sources for the creation of a digital edition, we usually encounter a range of textual phenomena:
- original layout information
- abbreviations or other "arcana"
- errors which invite correction or conjecture
- scribal additions, deletions, substitutions, restorations, transpositions
- passages which are damaged or illegible
- non-standard orthography which invites normalization
- …
Transcription of primary sources
- <teiHeader> provides descriptive and declarative metadata of a digital resource, possibly including a <msDesc>
- <text> contains a text-structural representation of a document's intellectual content (the text itself).
- <facsimile> organizes a set of page images representing a set of <surface>s
- <sourceDoc> a non-interpretive representation of a document considered purely as a physical object.
Note: The elements are invalid here.
Make sure you know why!
Abbreviations
Abbreviations are used in handwritten materials to shorten the scribal labour by using significant marks to replace:
- single letters
- groups of letters
- words
- whole phrases
Types of Abbreviations
- A suspension consists only of the first letter of a word or phrase, followed by a point. ('p.' for 'page')
- A contraction is a form of abbreviation where the letters in the middle of the word are omitted ('Dr' for 'Doctor'); the abbreviation can also come from more than one word.
- A brevigraph is a character representing two or more letters ('p with a bar through the descender' for 'per').
- Superscripts are represented by means of letters indicating various kinds of contractions: ('po' with superscript 'r' for 'pour')
Simple Editorial Changes
- The core module provides some phrase-level elements which may be used to record simple editorial interventions.
- <choice> groups alternative encodings for the same point in a text
- Abbreviations:
- <abbr> abbreviated form
- <expan> expanded form
- Errors:
- <sic> apparent error
- <corr> corrected error
- Regularization:
- <orig> original form
- <reg> regularized form
- Abbreviations:
Two Levels of Encoding Abbreviations
Abbreviations can be viewed in two different ways:
- As a representation of a particular sequence of letters or marks on the page:
a 'p with a bar through the descender'
- As a representation of the letters it is believed to stand for:
persone - The nice thing is that we don‘t have to decide which view we want to represent – the TEI can handle both in one document!
Two levels of encoding abbreviations
- <abbr> (abbreviation) and <expan> (expansion) encode the whole of an abbreviated word and the whole of its expansion
- <am> (abbreviation marker) contains a sequence of letters or signs present in an abbreviation which are omitted or replaced in the expanded form of the abbreviation
- <ex> (editorial expansion) contains a sequence of letters added by an editor or transcriber when expanding an abbreviation
Abbreviation and Expansion
Mr <expan>William</expan>
<lb />
<expan>Shakespeare</expan>
Mr <choice>
<abbr>W<am rend="abbr-sup">m</am></abbr>
<expan>W<ex>illia</ex>m</expan>
</choice>
<lb />
<choice>
<abbr>Shakes<am rend="abbr-per">p</am>e</abbr>
<expan>Shakes<ex>pear</ex>e</expan>
</choice>
Corrections and emendation
- Apparent errors in the text can be recorded in their original state, as corrected text, or combined with the <choice> element
- Processing software can present either the original or the correction.
- <sic> contains apparently incorrect or inaccurate text
- <corr> provides the correct reading of the text
William Shakespeare died in
<choice>
<sic>1614</sic>
<corr>1616</corr>
</choice>
Regularisation
Modifications
-
<mod> represents any kind of general modification without specific interpretation often used with a @type attribute for further specification
-
<add> addition to the text
-
<del> letter, word or phrase marked as deleted in the text
-
<subst> groups additions and deletions as a single
intervention -
<supplied> marks editorially supplied text
-
<unclear> marks where text is illegible, containing best guess
<l>And towards our distant rest began to trudge,</l>
<l>
<subst>
<del rend="strikethrough">Helping the worst amongst us</del>
<add>Dragging the worst amongt us</add>
</subst>, who'd no boots
</l>
<l>But limped on, blood-shod. All went lame;
<subst>
<del rend="strikethrough">half-</del>
<add>all</add>
</subst> blind;</l>
<l>Drunk with fatigue ; deaf even to the hoots</l>
<l>Of tired, outstripped <del rend="strikethrough">fif</del>
five-nines that dropped behind.</l>
Partly Legible Text
-
<unclear> marks where text is illegible, containing best guess
-
@reason states the cause of the uncertainty in the transcription
-
@resp indicates the party responsible for the interpretation
-
@cert signifies the degree of certainty of the interpretation
-
@agent categorises the cause of any damage
-
Omitted or Damaged Material
-
<gap> indicates a point where material is omitted
-
<damage> contains an area of damage to the text witness
<l>The Moving Finger
wri<damage agent="water" group="1">es; and</damage>
having writ,</l>
<l>Moves <damage agent="water" group="1">
<supplied>on: nor all your</supplied>
</damage> Piety nor Wit</l>
<gap unit="lines" quantity="8" reason="sampling"/>
Multiple Witnesses -- Critical Apparatus
-
<app> an entry in a critical apparatus
-
<lem> (optional) a lemma or base text
-
<rdg> a single reading within a textual variation
Digital Facsimiles
About Digital Facsimiles
- A digital facsimile is composed of digital images of the original source. A digitised source document may contain nothing more than page images and a small amount of metadata, but also an encoded transcription of the represented pages.
-
<facsimile> contains the representation of a written
source as a set of images - <graphic> indicates the location of any graphic using the @url attribute
Example of Digital Facsimile
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- metadata describing the edition -->
</teiHeader>
<facsimile>
<graphic url="page01.jpg" />
<graphic url="page02.jpg" />
<graphic url="page03.jpg" />
<graphic url="page04.jpg" />
</facsimile>
</TEI>
Referencing a Digital Facsimile
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!--...-->
</teiHeader>
<text>
<pb facs="page1.png"/>
<!-- text contained on
page 1 is encoded here -->
<pb facs="page2.png"/>
<!-- text contained on
page 2 is encoded here -->
</text>
</TEI>
Combining Transcription with a Facsimile
- Transcriptions may either be supplied in parallel to a
facsimile, or be documentary (embedded or non-interpretative) - If the transcription is regarded as a text in its own right and
organized independently of its physical realization in the document, the recommended practice is to use the <text> element to contain such a structured representation, and to present it in parallel. - If the transcription is intended to prioritize the process by which the document came to take its present form over its textual representation, it may be preferable to present it as a documentary (embedded) transcription within a <sourceDoc> element.
Parallel Transcription with Facsimile
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!--...-->
</teiHeader>
<facsimile>
<surface xml:id="p1">
<graphic url="p1.jpg"/>
</surface>
<!-- ... -->
</facsimile>
<text>
<pb facs="#p1" />
<!-- text contained on page 1 is encoded here -->
<pb facs="#p2" />
<!-- text contained on page 2 is encoded here -->
</text>
</TEI>
Surface and Zone
- <surface> defines a written surface as a two-dimensional coordinate space
- <zone> defines a single area on the surface using coordinates
- @points a list of point-pairs which build the text area (x,y).
- the @ulx, @uly, @lrx, @lry define the upper left corner and the lower right corner of a rectangle.
- to define these coordinates you can for example use the Oxygen facsimile plugin, the Image Markup Tool, etc.
A Documentary, Embedded or Non-Interpretative Transcription
<sourceDoc> contains a transcription of a single document, representing the physical surface of a document without interpretation of the text, e.g for building a dossier génétique.
- Similar to <facsimile>, a <sourceDoc> usually contains one or more <surface> elements with <zone> and <line> elements.
- <line> contains the transcription of a topographic line in the source document
- Some editorial markup is allowed (<add>, <del>,<unclear>, etc.) but you should only use elements here that are not interpretative.
A Documentary, Embedded or Non-Interpretative Transcription
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<!-- metadata -->
</teiHeader>
<sourceDoc>
<surface>
<zone>
<line><!-- Transcription of the
first line --></line>
<line><!-- Transcription of the
second line --></line>
</zone>
</surface>
</sourceDoc>
</TEI>
Genetic Editing: Marking up the Process
- <mod> generic tag for marking any kind of modification in the document, without attributing a specific function to it
- <metamark> any kind of written mark intended to determine how the document should be read
- <retrace> writing which has been rewritten or otherwise 'fixed' (e.g. replacing pencil with ink)
- <undo>, <redo> written modifications which have been reversed or reinstated
- <transpose>, <transposeGrp> transposed sequences
Metamarks
- The <metamark> element annotates marks such as numbers, arrows, crosses or other symbols, indicating how the text is to be read. These symbols are a kind of metatext, rather than forming part of the text
-
@function specifies the function of the metamark (e.g. status, insertion, deletion, transposition, used)
-
@target identifies one or more elements to which the function indicated by the metamark applies
<surface>
<metamark function="used" rend="line"
target="#X2"/>
<zone xml:id="X2">
<line>I am that halfgrown <add>angry</add>
boy, fallen asleep</line>
<line>The tears of foolish passion yet
undried</line>
<line>upon my cheeks.</line>
<!-- ... -->
<line>I pass through <add>the</add> travels
and <del>fortunes</del> of
<retrace>thirty</retrace>
</line>
<!-- ... -->
</zone>
<metamark function="used" target="#X2">
Entered - Yes</metamark>
</surface>
<undo> and <redo>
An alteration that has to be altered:
-
<undo> indicates one or more marked-up interventions in a document which have subsequently been marked for cancellation
- @target points to the element(s) which are to be reverted
-
<redo> indicates one or more cancelled interventions in a document which have subsequently been marked repeated.
- @target points to the element(s) which are to be reasserted
<line>
This is
<del change="#s2" rend="overstrike">
<seg xml:id="X-a">just some</seg>
sample
<seg xml:id="Xb">text</seg>,
we need
</del>
<add change="#s2">not</add>
a real example.
</line>
<undo target="#X-a #X-b" rend="dotted" change="#s3"/>
<line>
<redo target="#redo-1" cause="fix"/>
<mod xml:id="redo-1" rend="strikethrough" spanTo="#anchor-1" />
Ihr hagren, triſten, krummgezog nen Nacken
</line>
<line>Wenn ihr nur piepſet iſt die Welt ſchon matt.
<anchor xml:id="anchor-1"/></line>
Transpositions
Transpositions are passages that should be moved to a
different position. Metamarks (arrows, asterisks,
numbers…) often indicate the changes.
- The element <transpose> describes a single textual transposition as an ordered list of at least two pointers (<ptr>) specifying the order in which the elements indicated should be re-combined
- The element <listTranspose> supplies a list of transpositions, each of which is indicated at some point in a document, typically by means of metamarks. The list can be part of the <profileDesc>
<line>
<seg xml:id="ib01">bör</seg>
<metamark rend="underline"
function="transposition" target="#ib1"
place="above"> 2. </metamark>
og <seg xml:id="ib02">hör</seg>
<metamark rend="underline"
function="transposition" target="#ib02"
place="above">1. </metamark>
</line>
<!-- ... -->
<listTranspose>
<transpose>
<ptr target="#ib02"/>
<ptr target="#ib01"/>
</transpose>
</listTranspose>
Recording the Genesis of a Text
Writing process ("revision history")
-
<listChange> groups a list of revision phases
- @order whether the order of the change elements is significant or not
-
<change> describes a single revision phase
- @xml:id to identify the stages
- The list of revision phases is part of the <profileDesc>,<creation> in the TEI Header
Let us hypothesize that the different colors of ink here are
associated with different layers (stages, phases...)
Documenting the Layers
<profileDesc>
<creation>
<listChange ordered="true">
<change xml:id="stage-1">First layer, in black ink</change>
<change xml:id="stage-2">Second layer, in red</change>
<change xml:id="stage-3">Corrections and revisions,
in blue</change>
<change xml:id="stage-4">Deletions and usage notes
in green</change>
</listChange>
</creation>
</profileDesc>
Associating the Layers
<change xml:id="stage-1">First layer, in black ink</change>
<change xml:id="stage-2">Second layer, in red</change>
<!-- in a surface in sourceDoc -->
<zone xml:id="zone1" change="#stage-1">
<line> 28) le court de tennis. Les tribunes sont ... Deux joueurs</line>
<!-- ... -->
<line>30) l’un des joueurs de tennis se tient ... trois</line>
<line>fois sur le sol</line>
<zone change="#stage-2">
<line>31) </line>
<line>Vue de face</line>
<line> à contre jour</line>
<metamark function="add"/>
<line>la vieille dame ... dans le vestibule (contre-jour)</line>
</zone>
<!-- ... -->
</zone>
Elements in 'transcr' Module
-
Elements defined in the 'transcr' (Representatiion of Primary Sources) module:
TEI for Transcription
By James Cummings
TEI for Transcription
A TEI workshop presentation on TEI for Transcription, Editing, and Representing Primary Sources
- 2,368