Thinking about Collation

Workshop at TEI - MEC 2023,  Paderborn

Elisa, Hugh, Torsten, Raff

Thinking about Collation

Workshop Notes!

Part 1: Collation: an introduction

Comparing different versions of a text

Carter "Hailey’s Comet" Portable Optical Collator.



Tracing version history of a work

Think before you collate

What is your "theory of the text"?

How many texts are you comparing, and on what basis? 

What tools can best help you with this comparison? 

What constitutes a variant?

(and when collating music notation?)

Automated collation is
an exception

A few things that get weird

Friedrich Rochlitz

*1769 in Leipzig

music literate, critic, publisher


witnessed the Battle of Leipzig
in October 1813 (Napoleon vs. most of Europe)


author of “Tage der Gefahr ”

(„days of peril“), 1813 – 1822

Genesis of Tage der Gefahr

  • in 1813 Rochlitz started a letter to a friend in Dresden
  • the war in Leipzig interrupted all postal services
    • Rochlitz continued writing every day
    • the letter became a war diary
  • after the war he sent the letter/diary to his friend
    • the letter was read to many persons
    • Rochlitz decided to elaborate and to publish it
    • the letter became a biographic novel
  • first edition: 1816
    • the book was revised and extended
    • it became partially fictional
  • second edition: 1822

Rochlitz to Böttiger, 1813

„Ich habe in diesen Tagen stündlichen Schreckens und einer Noth, wie wir sie hier noch nicht gekannt, nicht schreiben können …“

First Edition, 1816

Revised Edition, 1822

1814/1822: “A Dream”

a text published in 1814 which was included in the 1822 edition


multiple changes of genre → no ideal base text

high degree of variance amongst “versions”


1816 / 1822: using apparatus seems convenient


1814 / 1816 / 1822: is an apparatus still the appropriate method?

Some TEI Code

two-level apparatus

level 1: alignment of chunks → focus on similarities

level 2: collation of tokens → focus on differences

chunk level

token level

Digital Edition

one-column presentation, marginal apparatus

→ eternal beta:

Planning a collation project (1/3)

  • What source documents will be used? Are there many witnesses, few, or one? Are the sources relatively close copies or not? (Music: what role will authorial adaptations like piano reductions or orchestrations have in determining variation?)
  • Will there be a single ‘base’ text? Or will witnesses be separately transcribed?
  • If a single base text will be used, will it be that of a particular witness, or will the editor attempt to reconstruct an ideal or original text?

about the source documents:

(from the TEI Guidelines Ch. 12 Critical Apparatus)

Planning a collation project (2/3)

  • Will each reading in an apparatus entry record every attestation (a ‘positive’ apparatus), or merely witnesses that deviate from the base text (a ‘negative’ apparatus)?
  • Will the readings of most or all witnesses be represented in the apparatus, or only a selection the editor deems relevant?

about the identification of reading "witnesses" 

(from the TEI Guidelines Ch. 12 Critical Apparatus)

Planning a collation project (3/3)

  • What level of variation will require distinguishing one witness reading from another? For example, will the editor consider an abbreviated word in a witness as agreeing with the base text, or not?
  • Will conjectures (variant readings suggested by an editor) be treated differently than readings found in witnesses?
  • Will there be a need to distinguish different types of variation, for example orthographic vs. morphological or lexical variants?

about the handling of variation

(from the TEI Guidelines Ch. 12 Critical Apparatus)


How can the Guidelines Critical Apparatus chapter better discuss the challenges of alignment?



How is alignment related to describing variants? 

Alignment: What constitutes a "same" starting point or ending point for a passage?

Do we need a theory of alignment for our texts?

An alignment challenge in the Frankenstein Variorum

An alignment challenge in the Frankenstein Variorum

To view the edition, see and click on Variorum Viewer

Some methods of collation

"Chunking up" the texts in small passages that align via...


  • Observation and direct encoding in a critical apparatus
  • Chunk/block the texts side by side in columns on a spread sheet
  • Get help from software like collateX

Alignment of large units

Alignment of large units

Alignment: smaller scale

			n="['&lt;del&gt;and this makes us all very wretched, as much so nearly as after the death 
            of your dear mother.&lt;/del&gt;', 'and this suspicion fills us with anguish. 
            i perceive that your father &lt;del&gt;conceals&lt;/del&gt; attempts to conceal his fears from me;
			but cheerfulness has flown from our little circle, only to be restored by a certain assuranance 
			that there is no foundation for our anxiety. at one time']">
			<rdg wit="fThomas">&lt;del rend="strikethrough"&gt;and this makes us all very wretched,
				as much so nearly as after the death of your dear mother.&lt;/del&gt; &lt;add
				place="bottom"&gt;and this suspicion fills us with anguish. I perceive that your
				father &lt;del-INNER&gt;conceals&lt;/del-INNER&gt; attempts to conceal his fears
				from me; but cheerfulness has flown from our little circle, only to be restored by a
				certain assuranance that there is no foundation for our anxiety. At one
		<rdgGrp n="['and', 'this', 'makes', 'us']">
			<rdg wit="f1818">and this makes us</rdg>
			<rdg wit="f1823">and this makes us</rdg>
			<rdg wit="fMS">&lt;mod sID="c56-0058__main__d5e11820"/&gt;&lt;sga-add
				place="superlinear" sID="c56-0058__main__d5e11822"/&gt;and this makes us</rdg>
		<rdgGrp n="['all']">
			<rdg wit="f1818">all</rdg>
			<rdg wit="f1823">all</rdg>
		<rdgGrp n="['very']">
			<rdg wit="f1818">very</rdg>
			<rdg wit="f1823">very</rdg>
			<rdg wit="fMS">very</rdg>
		<rdgGrp n="['wretched']">
			<rdg wit="fMS">wretched</rdg>
		<rdgGrp n="['wretched,']">
			<rdg wit="f1818">wretched,</rdg>
			<rdg wit="f1823">wretched,</rdg>

When you want to collate markup. . .

About that little problem...

...unless we

  • analyze the markup to determine where it meaningfully intersects with the simpler encoding of the print
  • ignore what markup has no counterpart in the other witnesses

Machine-assisted collation:

theory and practice

Background image created by Elisa from a picture of a machine loom on Reddit and the frontispiece illustration of Frankenstein (1831)

Machine-aided collation is weaving...

  • Source documents supplying “threads”
  • collation software = weaving machine
    • needs to be able to find where the “threads” run together,
    • and where they diverge (and what constitutes divergence)
    • Markup functions as signals to the collation weaving machine

Gothenburg model

algorithm for computer-assisted collation, developed in 2009 workshop of collateX and Juxta developers.

  1. Tokenization :

    • Break down the smallest unit of comparison: (words--with punctuation, or character-by-character):

    • FV tokenizes words and includes punctuation and tags:
      '&lt;del&gt;the', 'frame', 'on', 'whic&lt;del&gt;', 'my', 'man', 'completeed,.'

  2. Normalization  

    • ​​ '&' = 'and'

    • <p xml:id="novel1_letter4_div4_p2"> = <p/>

  3. Alignment

    • Identify comparable divergence: what makes text sequences comparable units?

    • “Chunking” text into comparable passages (chapters/paragraphs that line up with identifiable start and end points). Collation proceeds chunk by chunk.

  4. Analysis  

    • ​​ Study output, correct, and re-align after machine process, AND refine automated processing

  5. Visualization:

    • ​​Critical edition interface, graph displays



Juxta (no longer a web service)

Some machine-assisted methods

automatic alignment by segments + collation within segments

Tool Presentation: LERA

“Locate, Explore, Retrace and Apprehend complex text variants”

step 1: upload documents

feature: segmentation e.g. by TEI elements or line breaks

Tool Presentation: LERA

step 2: choose documents for alignment/collation (“edition”)

feature: automatic alignment of unaligned segments

Tool Presentation: LERA

step 3: alignment revision and collation configuration

feature: dynamically updated visualizations

Tool Presentation: LERA

step 4: TEI export

Tool Presentation: LERA

Machine assistance or hindrance?

Discuss: Troubles with Gothenburg...

 <app xml:id="C11_app11" n="20">
               <rdgGrp xml:id="C11_app11_rg_empty">
                  <rdg wit="#f1831"/>
               <rdgGrp n="['henry', '–', 'surely', 'victor']" xml:id="C11_app11_rg1">
                  <rdg wit="#fMS">
                     <ptr target=""/>
                     <witDetail wit="#fMS" target="sga:c56/#/p58">
                        <ref type="page"
                           <ptr target="[@xml:id='ox-ms_abinger_c56-0058']/tei:zone[@type='main']//tei:line[8],0,22)"/>
                           <fv:line_text>Henry – Surely Victor</fv:line_text>
                           <fv:resolved_text>Henry – Surely Victor</fv:resolved_text>
               <rdgGrp n="['henry.', 'surely,', 'victor,']" xml:id="C11_app11_rg2">
                  <rdg wit="#f1818">
                     <ptr target=""/>
                  <rdg wit="#f1823">
                     <ptr target=""/>
                  <rdg wit="#fThomas">
                     <ptr target=""/>

Critical apparatus with pointers

Jupyter Notebook exercise

Thinking about Collation

Now, Discussion! Go to the Workshop Notes!

Discussion Topics

1.  What special challenges do we encounter in the collation of music—in identification of meaningful variation, determining alignment, etc? 


2. When would you want to collate markup or "pseudomarkup" in your projects? Or what kinds of projects might benefit from this?


3. Try out the Jupyter notebook linked in these slides.  Where might "machine-assisted collation" be beneficial? Where might it cause problems?


4. What's wrong with machine-assisted collation now? How should it be better? What would it need to do to better work with humanities text / music collation challenges?



Part 2: Modeling the Critical Apparatus

Guidelines Chapter 12

<app>, <lem>, <rdg>, <rdgGrp>, <wit>, <witDetail>, @wit


<app> started out as phrase level, but can contain structures (as of v. 2.9.1, fall 2015)


<app> can nest, so multi-level variation can be handled, but is that enough?

Overview of approaches used in TEI and MEI

  • Lachmanian
 <tei:rdg wit="#El">Experience though noon Auctoritee</rdg>
 <tei:rdg wit="#La">Experiment thouh noon Auctoritee</rdg>
 <tei:rdg wit="#Ra2">Eryment though none auctorite</rdg>
  • "Multitext variorum"
<!-- In the document content: -->
   <mei:rdg source="#critApp.source1">
      <!-- reading of source 1 -->
   <mei:rdg source="#critApp.source2 #critApp.source3">
      <!-- reading of sources 2 *and* 3 -->

TEI Guidelines Chapter 12

MEI Guidelines Chapter 11

Overview of approaches used in TEI and MEI

  • "Multitext variorum"
  • What do we mean?
  • No official recommendation from the Guidelines

multiple witnesses to largely the same text

multiple versions

(new editions, reworked material)

text re-use

"Multitext variorum" Stand-off collation

From Early Modern Songscapes:

"Multitext variorum" Stand-off collation

Modeling collation (critical apparatus)



  • for scholarly research questions: what do you need to investigate?
  • for interface development: What can you build with the critical apparatus?
    • What can a critical apparatus help you to build and visualize in an edition?

Critical Apparatus Challenges

too many witnesses...

working with data from the critical apparatus...mapping it back to the editions


Choose a topic to discuss in small groups for ~30 minutes.

  1. Discuss your projects! What are you doing with the critical apparatus?
  2. What would you like to see in the critical apparatus?
  3. What would you like to see the TEI provide for expressing variation inline?
  4. What tooling do we need for document variance modeling?
  5. What tooling do we need for interface development?


Thinking about collation

By Elisa Beshero-Bondar

Thinking about collation

Slides to accompany a workshop at TEI - MEC 2023 Paderborn

  • 208