TEI for Transcription, Editing,  and Representing Primary Sources

Dr James Cummings

@jamescummings

http://slides.com/jamescummings/tei-for-primary-sources

Press space to move through slides; CC+by; Thanks to TEI Community

Transcription: A special kind of reading?

  • Goals of transcription:
    • to make a primary source accessible
    • and comprehensible
    • which may entail adding and/or using additional information
  • Hence:
    • transcription is selective
    • transcription is interpretative
    • transcription is subjective and depends on the editors' decisions

Just like the application of markup to a text! 

The act of transcribing

  • Inspect the original document or an image (facsimile) of it
  • Identify areas on the document like text, graphics, etc.
  • Find the first line
  • Start typing the first line, identify special characters
  • Record textual modifications/interventions like highlighting, (interlinear) additions, deletions, transpositions, etc.
  • Identify text-structures
  • Recognize writing activities
  • Identify named entities like persons, places, etc. 

At what point does transcription become editing?

The act of transcribing

  • Inspect the original document or an image (facsimile) of it → <facsimile> or <sourceDoc>
  • Identify areas on the document like text, graphics, etc.
    <surface> or <zone> etc.
  • Find the first line <line> etc.
  • Start typing the first line, identify special characters <g> or <hi> etc.
  • Record textual modifications/interventions like highlighting, (interlinear) additions, deletions, transpositions, etc.  <hi> or <mod> etc.
  • Identify text-structures  <div>, <head>, <p>, <list> etc.
  • Recognise writing activities  <add>, <del>, or <abbr> etc.
  • Identify named entities like persons, places, etc.  <name>, <persName>, or <placeName> etc.

What‘s in a transcription?

When transcribing primary sources for the creation of a digital edition, we usually encounter a range of textual phenomena:

  • original layout information
  • abbreviations or other "arcana"
  • errors which invite correction or conjecture
  • scribal additions, deletions, substitutions, restorations, transpositions
  • passages which are damaged or illegible
  • non-standard orthography which invites normalization
  • … ​

Transcription of primary sources

  • <teiHeader> provides descriptive and declarative metadata of a digital resource, possibly including a <msDesc>
  • <text> contains a text-structural representation of a document's intellectual content (the text itself).
  • <facsimile> organizes a set of page images representing a set of <surface>s
  • <sourceDoc> a non-interpretive representation of a document considered purely as a physical object. 

Note: The elements are invalid here.

Make sure you know why!

Abbreviations

Abbreviations are used in handwritten materials to shorten the scribal labour by using significant marks to replace:

  • single letters
  • groups of letters
  • words
  • whole phrases                                          

Types of Abbreviations

  • A suspension consists only of the first letter of a word or phrase, followed by a point. ('p.' for 'page')
  • A contraction is a form of abbreviation where the letters in the middle of the word are omitted ('Dr' for 'Doctor'); the abbreviation can also come from more than one word.
  • A brevigraph is a character representing two or more letters ('p with a bar through the descender' for 'per').
  • Superscripts are represented by means of letters indicating various kinds of contractions: ('po' with superscript 'r' for 'pour')

Simple Editorial Changes

  • The core module provides some phrase-level elements which may be used to record simple editorial interventions.
  • <choice> groups alternative encodings for the same point in a text
    • Abbreviations:
      • ​<abbr> abbreviated form
      • ​<expan> expanded form
    • ​Errors:
      • <sic> apparent error
      • ​<corr> corrected error
    • ​Regularization:
      • <orig> original form
      • <reg> regularized form  

Two Levels of Encoding Abbreviations

Abbreviations can be viewed in two different ways:

  • As a representation of a particular sequence of letters or marks on the page:
    a 'p with a bar through the descender'

     

  •  
  • As a representation of the letters it is believed to stand for:
    persone
  • The nice thing is that we don‘t have to decide which view we want to represent – the TEI can handle both in one document!

Two levels of encoding abbreviations

  • <abbr> (abbreviation) and <expan> (expansion) encode the whole of an abbreviated word and the whole of its expansion
  • <am> (abbreviation marker) contains a sequence of letters or signs present in an abbreviation which are omitted or replaced in the expanded form of the abbreviation
  • <ex> (editorial expansion) contains a sequence of letters added by an editor or transcriber when expanding an abbreviation

Abbreviation and Expansion

Mr <expan>William</expan>
<lb />
<expan>Shakespeare</expan>
Mr <choice>
 <abbr>W<am rend="abbr-sup">m</am></abbr>
 <expan>W<ex>illia</ex>m</expan>
</choice>
<lb />
<choice>
 <abbr>Shakes<am rend="abbr-per">p</am>e</abbr>
 <expan>Shakes<ex>pear</ex>e</expan>
</choice>

Corrections and emendation

  • Apparent errors in the text can be recorded in their original state, as corrected text, or combined with the <choice> element
  • Processing software can present either the original or the correction.
    • <sic> contains apparently incorrect or inaccurate text
    • <corr> provides the correct reading of the text
William Shakespeare died in
<choice>
 <sic>1614</sic> 
 <corr>1616</corr>
</choice>

Regularisation

Modifications

  • <mod> represents any kind of general modification without specific interpretation often used with a @type attribute for further specification

  • <add> addition to the text

  • <del> letter, word or phrase marked as deleted in the text

  • <subst> groups additions and deletions as a single
    intervention

  • <supplied> marks editorially supplied text

  • <unclear> marks where text is illegible, containing best guess

<l>And towards our distant rest began to trudge,</l>
<l>
 <subst>
 <del rend="strikethrough">Helping the worst amongst us</del>
 <add>Dragging the worst amongt us</add>
 </subst>, who'd no boots
</l>
<l>But limped on, blood-shod. All went lame;
<subst>
 <del rend="strikethrough">half-</del>
 <add>all</add>
</subst> blind;</l>
<l>Drunk with fatigue ; deaf even to the hoots</l>
<l>Of tired, outstripped <del rend="strikethrough">fif</del> 
five-nines that dropped behind.</l>

Partly Legible Text

  • <unclear> marks where text is illegible, containing best guess

    • @reason states the cause of the uncertainty in the transcription

    • @resp indicates the party responsible for the interpretation

    • @cert signifies the degree of certainty of the interpretation

    • @agent categorises the cause of any damage

Omitted or Damaged Material

  • <gap> indicates a point where material is omitted

  • <damage> contains an area of damage to the text witness

<l>The Moving Finger 
wri<damage agent="water" group="1">es; and</damage> 
having writ,</l>
<l>Moves <damage agent="water" group="1">
  <supplied>on: nor all your</supplied>
 </damage> Piety nor Wit</l>
<gap unit="lines" quantity="8" reason="sampling"/>

Multiple Witnesses -- Critical Apparatus

  • <app> an entry in a critical apparatus

  • <lem> (optional) a lemma or base text

  • <rdg> a single reading within a textual variation

Digital Facsimiles

About Digital Facsimiles

  • A digital facsimile is composed of digital images of the original source. A digitised source document may contain nothing more than page images and a small amount of metadata, but also an encoded transcription of the represented pages.

     
  • <facsimile> contains the representation of a written
    source as a set of images
  • <graphic> indicates the location of any graphic using the @url attribute

Example of Digital Facsimile

<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader>
 <!-- metadata describing the edition -->
 </teiHeader>
 <facsimile>
 <graphic url="page01.jpg" />
 <graphic url="page02.jpg" />
 <graphic url="page03.jpg" />
 <graphic url="page04.jpg" />
 </facsimile>
</TEI>

Referencing a Digital Facsimile

<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader>
 <!--...-->
 </teiHeader>
 <text>
 <pb facs="page1.png"/>
 <!-- text contained on 
    page 1 is encoded here -->
 <pb facs="page2.png"/>
 <!-- text contained on 
    page 2 is encoded here -->
 </text>
</TEI>

Combining Transcription with a Facsimile

  • Transcriptions may either be supplied in parallel to a
    facsimile, or be documentary (embedded or non-interpretative)
  • If the transcription is regarded as a text in its own right and
    organized independently of its physical realization in the document, the recommended practice is to use the <text> element to contain such a structured representation, and to present it in parallel.
  • If the transcription is intended to prioritize the process by which the document came to take its present form over its textual representation, it may be preferable to present it as a documentary (embedded) transcription within a <sourceDoc> element.

Parallel Transcription with Facsimile

<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader>
 <!--...-->
 </teiHeader>
 <facsimile>
 <surface xml:id="p1">
  <graphic url="p1.jpg"/>
 </surface>
  <!-- ... -->
 </facsimile>
 <text>
 <pb facs="#p1" />
 <!-- text contained on page 1 is encoded here -->
 <pb facs="#p2" />
 <!-- text contained on page 2 is encoded here -->
 </text>
</TEI>

Surface and Zone

  • <surface> defines a written surface as a two-dimensional coordinate space
  • <zone> defines a single area on the surface using coordinates
  • @points a list of point-pairs which build the text area (x,y).
  • the @ulx, @uly, @lrx, @lry define the upper left corner and the lower right corner of a rectangle.
  • to define these coordinates you can for example use the Oxygen facsimile plugin, the Image Markup Tool, etc.

A Documentary, Embedded or Non-Interpretative Transcription

<sourceDoc> contains a transcription of a single document, representing the physical surface of a document without interpretation of the text, e.g for building a dossier génétique.

  • Similar to <facsimile>, a <sourceDoc> usually contains one or more <surface> elements with <zone> and <line> elements.
  • <line> contains the transcription of a topographic line in the source document
  • Some editorial markup is allowed (<add>, <del>,<unclear>, etc.) but you should only use elements here that are not interpretative.

A Documentary, Embedded or Non-Interpretative Transcription

<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
 <!-- metadata -->
 </teiHeader>
 <sourceDoc>
 <surface>
 <zone>
 <line><!-- Transcription of the 
         first line --></line>
 <line><!-- Transcription of the 
         second line --></line>
 </zone>
 </surface>
 </sourceDoc>
</TEI>

Genetic Editing: Marking up the Process

  • <mod> generic tag for marking any kind of modification in the document, without attributing a specific function to it
  • <metamark> any kind of written mark intended to determine how the document should be read
  • <retrace> writing which has been rewritten or otherwise 'fixed' (e.g. replacing pencil with ink)
  • <undo>, <redo> written modifications which have been reversed or reinstated
  • <transpose>, <transposeGrp> transposed sequences

Metamarks

  • The <metamark> element annotates marks such as numbers, arrows, crosses or other symbols, indicating how the text is to be read. These symbols are a kind of metatext, rather than forming part of the text
     
  • @function specifies the function of the metamark (e.g. status, insertion, deletion, transposition, used)
     
  • @target identifies one or more elements to which the function indicated by the metamark applies
     
<surface>
 <metamark function="used" rend="line"
target="#X2"/>
 <zone xml:id="X2">
 <line>I am that halfgrown <add>angry</add>
 boy, fallen asleep</line>
 <line>The tears of foolish passion yet
 undried</line>
 <line>upon my cheeks.</line>
 <!-- ... -->
 <line>I pass through <add>the</add> travels
 and <del>fortunes</del> of
 <retrace>thirty</retrace>
 </line>
 <!-- ... -->
 </zone>
 <metamark function="used" target="#X2">
 Entered - Yes</metamark>
</surface>

<undo> and <redo>

An alteration that has to be altered:

  • <undo> indicates one or more marked-up interventions in a document which have subsequently been marked for cancellation
    • @target points to the element(s) which are to be reverted
  • <redo> indicates one or more cancelled interventions in a document which have subsequently been marked repeated.
    • @target points to the element(s) which are to be reasserted
<line>
This is 
<del change="#s2" rend="overstrike">
 <seg xml:id="X-a">just some</seg> 
  sample 
 <seg xml:id="Xb">text</seg>,
  we need
</del>
<add change="#s2">not</add> 
  a real example.
</line>


<undo target="#X-a #X-b" rend="dotted" change="#s3"/>
<line>
 <redo target="#redo-1" cause="fix"/>

 <mod xml:id="redo-1" rend="strikethrough" spanTo="#anchor-1" />
 Ihr hagren, triſten, krummgezog nen Nacken
</line>

<line>Wenn ihr nur piepſet iſt die Welt ſchon matt. 
 <anchor xml:id="anchor-1"/></line>

Transpositions

Transpositions are passages that should be moved to a
different position. Metamarks (arrows, asterisks,
numbers…) often indicate the changes.

  • The element <transpose> describes a single textual transposition as an ordered list of at least two pointers (<ptr>) specifying the order in which the elements indicated should be re-combined
  • The element <listTranspose> supplies a list of transpositions, each of which is indicated at some point in a document, typically by means of metamarks. The list can be part of the <profileDesc>
<line>
 <seg xml:id="ib01">bör</seg>
 <metamark rend="underline" 
   function="transposition" target="#ib1"
   place="above"> 2. </metamark>
 og <seg xml:id="ib02">hör</seg>
 <metamark rend="underline" 
   function="transposition" target="#ib02"
   place="above">1. </metamark>
</line>
<!-- ... -->
<listTranspose>
 <transpose>
 <ptr target="#ib02"/>
 <ptr target="#ib01"/>
 </transpose>
</listTranspose>

Recording the Genesis of a Text

Writing process ("revision history")

  • <listChange> groups a list of revision phases
    • @order whether the order of the change elements is significant or not
  • <change> describes a single revision phase
    • @xml:id to identify the stages
  • The list of revision phases is part of the <profileDesc>,<creation> in the TEI Header

Let us hypothesize that the different colors of ink here are
associated with different layers (stages, phases...)

Documenting the Layers

<profileDesc>
 <creation>
 <listChange ordered="true">
 <change xml:id="stage-1">First layer, in black ink</change>
 <change xml:id="stage-2">Second layer, in red</change>
 <change xml:id="stage-3">Corrections and revisions, 
  in blue</change>
 <change xml:id="stage-4">Deletions and usage notes 
  in green</change>
 </listChange>
 </creation>
</profileDesc>

Associating the Layers

<change xml:id="stage-1">First layer, in black ink</change>
<change xml:id="stage-2">Second layer, in red</change>


<!-- in a surface in sourceDoc -->

<zone xml:id="zone1" change="#stage-1">
 <line> 28) le court de tennis. Les tribunes sont ... Deux joueurs</line>
 <!-- ... -->
 <line>30) l’un des joueurs de tennis se tient ... trois</line>
 <line>fois sur le sol</line>
 <zone change="#stage-2">
 <line>31) </line>
 <line>Vue de face</line>
 <line> à contre jour</line>
 <metamark function="add"/>
 <line>la vieille dame ... dans le vestibule (contre-jour)</line>
 </zone>
 <!-- ... -->
</zone>

Elements in 'transcr' Module

TEI for Transcription

By James Cummings

TEI for Transcription

A TEI workshop presentation on TEI for Transcription, Editing, and Representing Primary Sources

  • 2,181