Adventures in Correcting XML Collation with Python and XSLT:

Elisa Beshero-Bondar

Twitter: @epyllia | GitHub: @ebeshero

Balisage: The Markup Conference
Monday August 1, 2022 @ 11am

Link to these slides: https://bit.ly/untangle-fv

Background image created by the author from a loom on Reddit and the frontispiece illustration of Frankenstein (1831)

Untangling the Frankenstein Variorum

A Digital Variorum

Our use of the term: A digital edition that investigates change to a work by comparing distinct versions of it.

Machine-assisted collation

Our method of conducting the comparison
Software helps us to find meaningful moments of comparison and difference
Humans teach the software how to interpret the texts to compare them meaningfully

What we're collating

Print Publications of Frankenstein in MWS's lifetime:

1818 Edition published anonymously (3 volumes)
1823 Edition printed by MWS's father William Godwin, the first to include her name as the author. (2 volumes)
1831 Edition extensively revised by MWS, bound with Friedrich von Schiller's The Ghost Seer in Bentley's Standard Series of novels) (1/2 of a volume)

Manuscript versions:

Fair copy MS drafts (~1816) at the Bodleian Library, Oxford: Abinger c56, c57, c58
Thomas Copy made sometime between 1818 and 1822: MWS's marginal comments on a print copy of 1818

Collaboration effort across institutions since 2017

Scott Weingart, Rikk Mulligan, Matt Lincoln, Jon Klancher and team at Carnegie Mellon University:
- got us started, conceptualize the edition interface, location/prep of source texts + scholarly editorial annotations.
  In flux...several left CMU, some remain connected.
Raffaele Viglianti (MITH: Shelley-Godwin Archive):
- interface development + assist w/ work on the Shelley-Godwin archive
Me (Pitt-Greensburg => Penn State Erie):
- Back-end XML Collation, pipeline to output TEI editions w/ variation data
- Need as much help as I can get from fresh eyes and ideas! (Thanks, Mia Borgia, Jackie Chan, and Yuying Jin!)
Angela Brunk (Kansas State U):
- new connection; consults on UX / accessibility for the interface

James Rieger, ed., first new edition of 1818 in 141

years : inline collation of "Thomas" w/ 1818,

1831 variants in endnotes

Legend:

Stuart Curran and Jack Lynch: PA Electronic Edition (PAEE) , collation of 1818 and 1831: HTML

Nora Crook crit. ed of 1818, variants of "Thomas", 1823, and 1831 in endnotes (P&C MWS collected works)

Romantic Circles TEI conversion of PAEE ; separates the texts of 1818 and 1831; collation via Juxta

1974

~mid-1990s

1996

Charles Robinson, The Frankenstein Notebooks (Garland): print facsimile of 1816 ms drafts

2007

Shelley-Godwin Archive publishes diplomatic/documentary edition of 1816 ms drafts

print edition

digital edition

Legend:

2013

2017

Critical and Documentary Editions Leading to the Frankenstein Variorum Project

Frankenstein Variorum Project :

assembly/proof-correcting of PAEE files; OCR/proof-correcting 1823; "bridge" TEI edition of S-GA notebook files; automated collation; incorporating "Thomas" copy text

Can we make an edition that conveniently compares the manuscripts to the print publications?
Can we make a comprehensive variorum to show changes to the novel over time, from 1816 to 1831?
- Which editorial interventions persist from 1816 to 1831?
  - MWS in the "Thomas" copy: how much of this persists into 1831?
  - PBS's additions: which/how many of these persist to 1831?
  - What parts of the novel were most mutable?

Motivating Questions

Variorum challenge

1. Share a nonlinear, divergent edition history

1816 notebooks to 1818: uneven (gaps in notebooks)
Thomas divergence:
- copy with margin notes was left in Italy before 1823, apparently not consulted later
1823 edits: largely retained in 1831
1831 major revisions:
- alteration of character relationships, added chapter and several lengthened passages

2. Introduce textual scholarship to students, fans of Frankenstein as well as text scholars, 19c specialists:

Recruit next generation of text scholars
Don't tuck variants away into endnotes/footnotes
Instead, tell the story of Frankenstein’s ”hot” and ”cool” alterations

The “base markup” for collation

Each version deemed a "reading witness" to the novel at a distinct moment in its textual history.
Print books
- simple structural markup for 1818, 1823, and 1831 eds.
Manuscript versions
- Thomas copy (handwritten notes on 1818 print text): prepped from 1818 files with some simple manuscript coding of handwritten insertions, deletions, and notes
- 1816 MS Notebook: It's complicated...

PA Electronic Edition (mid 1990s): 1818 vs 1831

started from base HTML 1.0 files
up-converted to clean, simple XML
- ”on its way” to TEI (structural elements in text)
- prepared for machine-assisted collation (via CollateX): including element tags
- deep hierarchy of novel ”flattened” to milestones: <div type="volume"/>, <p/>, etc.
corrected against photofacsimiles of 1818 and 1831 print publications

Prepared from OCR new XML of 1823 edition

prepared by William Godwin, the first edition bearing the name ”Mary Wollstonecraft Shelley” on the title page
XML syntax matches that of 1818 and 1831 editions

Working with MS versions

Perhaps more to the point, I can say. . .

Well, mostly...

...But let's take it one stage at a time...

About that little problem...

analyze the markup to determine where it meaningfully intersects with the simpler encoding of the print
ignore what markup has no counterpart in the other witnesses

...unless we

Shelley-Godwin Archive’s diplomatic edition of the 1816 Notebooks at http://shelleygodwinarchive.org

collection of TEI files, one file per notebook page
encoded page surface-by-surface and line-by-line
required me to resequence the code to include in the collation*
- *margin notes were not encoded at point of insertion but at the end of each page file
- Insertion points were marked carefully with @xml:ids and pointers, so resequencing with XSLT worked
- We produced an alternate version of the S-GA files in order to collate it in reading order.

“S-GA msColl”

Shelley-Godwin Archive: sample page surface:

S-GA: resequenced / compressed for collation

<surface lrx="3847" lry="5342" 
partOf="#ox-frankenstein_volume_i" 
ulx="0" uly="0" folio="21r" shelfmark="MS. Abinger c. 56" base="ox-ms_abinger_c56/ox-ms_abinger_c56-0045.xml" 
id="ox-ms_abinger_c56-0045" sID="ox-ms_abinger_c56-0045"/>
      <graphic url="http://shelleygodwinarchive.org/images/ox/ms_abinger_c56/ms_abinger_c56-0045.jp2"/>
      <zone type="main" sID="c56-0045__main"/> 

<lb n="c56-0045__main__17"/> 
         <del rend="strikethrough" sID="c56-0045__main__d2e9811"/>But how<del eID="c56-0045__main__d2e9811"/> How can I describe
      my <lb n="c56-0045__main__18"/> emotion at this catastrophe; or how 

<w ana="start"/>deli<lb n="c56-0045__main__19"/>neate<w ana="end"/> 

the wretch whom with such <lb n="c56-0045__main__20"/> infinite pains and care I had endeavoured <lb n="c56-0045__main__21"/> to form. His limbs were in proportion <lb n="c56-0045__main__22"/> and I had selected his features <del rend="strikethrough" sID="c56-0045__main__d2e9830"/>h<del eID="c56-0045__main__d2e9830"/> as <lb n="c56-0045__main__23"/> 
         <mod sID="c56-0045__main__d2e9835"/>
            <del rend="strikethrough" sID="c56-0045__main__d2e9837"/>handsome<del eID="c56-0045__main__d2e9837"/>
            <mdel>.</mdel>
            <anchor xml:id="c56-0045.01"/>
            <zone corresp="#c56-0045.01" type="left_margin" sID="c56-0045__left_margin"/> 
               <lb n="c56-0045__left_margin__1"/> 
               <add sID="c56-0045__left_margin__d2e9849"/>
                  <mod sID="c56-0045__left_margin__d2e9851"/>
                     <del rend="strikethrough" sID="c56-0045__left_margin__d2e9853"/>handsome<del eID="c56-0045__left_margin__d2e9853"/>
                     <add hand="#pbs" place="superlinear" sID="c56-0045__left_margin__d2e9856"/>beautiful.<add eID="c56-0045__left_margin__d2e9856"/>
                  <mod eID="c56-0045__left_margin__d2e9851"/>
               <add eID="c56-0045__left_margin__d2e9849"/>
            <zone eID="c56-0045__left_margin"/>
         <mod eID="c56-0045__main__d2e9835"/>
         <mod sID="c56-0045__main__d2e9863"/>
            <del rend="strikethrough" sID="c56-0045__main__d2e9865"/>Handsome<del eID="c56-0045__main__d2e9865"/>
            <add hand="#pbs" place="superlinear" sID="c56-0045__main__d2e9868"/>Beautiful<add eID="c56-0045__main__d2e9868"/>
         <mod eID="c56-0045__main__d2e9863"/>; Great God! His <lb n="c56-0045__main__24"/>

added word boundary markup to indicate whole words spanning lines
resequenced margin zone content: (followed S-GA's pointers to represent semantic reading order for collation)

“Thomas” (much easier...)

Thomas copy marginalia

prepared new XML from 1818 edition, with <add>, <del>, <note> elements

Added new XML: Thomas Copy

Added insertions, deletions, + margin-notes to 1818 edition using <add>, <del>, and <note>
checked against and respond to/update Nora Crook and James Reiger editions

Frankenstein Variorum: The Collation Process

Gothenburg model : algorithm for computer-assisted collation, developed in 2009 workshop of collateX and Juxta developers.

Tokenization :
- Break down the smallest unit of comparison: (words--with punctuation, or character-by-character):
- FV tokenizes words and includes punctuation and tags:
  '<del>the', 'frame', 'on', 'whic<del>', 'my', 'man', 'completeed,.'
Normalization
- '&' = 'and'
- <p xml:id="novel1_letter4_div4_p2"> = <p/>
Alignment
- Identify comparable divergence: what makes text sequences comparable units?
- “Chunking” text into comparable passages (chapters/paragraphs that line up with identifiable start and end points). Collation proceeds chunk by chunk.
Analysis
- Study output, correct, and re-align after machine process, AND refine automated processing
Visualization:
- Critical edition interface, graph displays

Computer-aided collation: Gothenburg Model

Collation is weaving...

Source documents supplying “threads”
collation software = weaving machine
- needs to be able to find where the “threads” run together,
- and where they diverge (and what constitutes divergence)
- Markup functions as signals to the collation weaving machine

Guiding the machine reading

Markup is not the same in all of the editions
That's okay, because we hide some of it from collation software
Our method:
- Python script instructs on strings of text that the collation algorithm can just skip over
- Marking the markup: Identify markup that is not helpful for comparing the documents

With witnesses prepared and inspected, we turn to the Python script that tokenizes and normalizes the source files, preparing the witnesses to be compared by collation software.

Image credits: https://rootsofprogress.org/learning-the-loom,

https://www.oschaslings.com/blog/warp-weft-weave-structure-understanding-how-your-sling-is-made-wrapping-qualities

warp: sets tension or looseness of weave:
normalizing the text "thread" to tell us how to "pull" it

weft: moves through the warp threads cross-wise:

establishes moments of alignment across the text threads

Which power loom?

XSLT or Python/CollateX?

XSLT differentiation driven by XPath string comparison
- tan:collate() might be desirable if it's searching for the longest possible moments of comparison
- Precision/legiblity of normalizing code
- Needs a lot of adaptation from the TAN library to incorporate into production pipeline
  - Collation software sometimes prioritizes the normalization as the "source of truth", but that's only an interpretation
  - But we need to output the original source text together with its normalized basis for comparison

Which power loom?

XSLT or Python/CollateX?

Python and collateX
- Bespoke TEI output: customized in our project pipeline
- Maybe just tokenization /alignment problem that we can correct?
- Needs to be rewritten and better documented for sustainability; sharing the code

Tokenization

You need to be able to modify this
What is the smallest basis for comparison of text streams?
- Character by character?
- Word by word?
  - including punctuation?
  - including markup around words?
  - isolating the markup?

Tokenization

Is our tokenization too granular?
- I'm working on that now with the Python-to-collateX script
- Not sure how to modify to this degree of precision with tan:collate() yet

def tokenize(inputFile):
        return regexLeadingBlankLine.sub('\n', regexBlankLine.sub('\n', 
					extract(inputFile))).split('\n')

reduces multiple `\n`with a single `\n`and separates word and element tokens with newlines. The actual tokens are built up by the extract() function. . .

Tokenization: extracting XML with Pulldom

def extract(input_xml):
    """Process entire input XML document, firing on events"""
    # Start pulling; it continues automatically
    doc = pulldom.parse(input_xml)
    output = ''
    for event, node in doc:
        # elements to ignore: xml
        if event == pulldom.START_ELEMENT and node.localName in ignore:
            continue
     
        if event == pulldom.START_ELEMENT and node.localName in inlineVariationEvent:
            doc.expandNode(node)
            output += '\n' + node.toxml() + '\n'
        
        elif event == pulldom.START_ELEMENT and node.localName in blockEmpty:
            output += '\n' + node.toxml() + '\n'
        # ebb: empty inline elements that do not take surrounding white spaces:
        elif event == pulldom.START_ELEMENT and node.localName in inlineEmpty:
            output += node.toxml()
       
        elif event == pulldom.START_ELEMENT and node.localName in inlineContent:
            output += '\n' + regexEmptyTag.sub('>', node.toxml())
        elif event == pulldom.END_ELEMENT and node.localName in inlineContent:
            output += '</' + node.localName + '>' + '\n'
      
        elif event == pulldom.CHARACTERS:
            output += normalizeSpace(node.data)
        else:
            continue
    return output

ignore = ['mod', 'sourceDoc', 'xml', 'comment', 'anchor', 'include',
          'delSpan', 'addSpan','handShift', 'damage', 
          'restore', 'zone', 'surface',
          'graphic', 'unclear', 'retrace']
blockEmpty = ['pb', 'p', 'div', 'milestone',
              'lg', 'l', 'cit', 'quote',
              'bibl', 'ab', 'head']
inlineEmpty = ['sga-add', 'lb', 'gap',  'hi', 'w']
inlineContent = ['del-INNER', 'add-INNER', 'metamark',
                 'mdel', 'shi']
inlineVariationEvent = ['del', 'add', 'note']

In the Python script:

Prepare the warp: create lists of XML element names for special treatment

Ignored whole elements need to be screened out of the collation entirely.
Other whole elements need to be preserved but normalized.
XML Pulldom library helps us with special handling of XML elements.

Normalization

How should the collation software read the tokens and understand sameness?


                <!-- Should punctuation be ignored? -->
    <xsl:param name="tan:ignore-punctuation-differences" as="xs:boolean" select="false()"/>
    
    <xsl:param name="additional-batch-replacements" as="element()*">
        <!--ebb: normalizations to batch process for collation. NOTE: We want to do these to preserve some markup \\
            in the output for post-processing to reconstruct the edition files. 
            Remember, these will be processed in order, so watch out for conflicts. -->
        <replace pattern="(<.+?>\s*)&gt;" replacement="$1" 
                 message="normalizing away extra right angle brackets"/>
         <replace pattern="&amp;" replacement="and" 
                  message="ampersand batch replacement"/>
        <replace pattern="</?xml>" replacement="" 
                 message="xml tag replacement"/>
        <replace pattern="(<p)\s+.+?(/>)" replacement="$1$2" 
                 message="p-tag batch replacement"/>
        <replace pattern="(<)(metamark).*?(>).+?\1/\2\3" replacement="" 
                 message="metamark batch replacement"/>
      <!--ebb: metamark contains a text node, and we don't want its 
contents processed in the collation, so this captures the entire element. -->
        <replace pattern="(</?)m(del).*?(>)" 
                 replacement="$1$2$3" message="mdel-SGA batch replacement"/> 
      <!--ebb: mdel contains a text node, so this catches both start and end tag.
        We want mdel to be processed as <del>...</del>-->
        <replace pattern="</?damage.*?>" 
                 replacement="" message="damage-SGA batch replacement"/> 
      <!--ebb: damage contains a text node, so this catches both start and end tag. -->
        <replace pattern="</?unclear.*?>" replacement="" 
                 message="unclear-SGA batch replacement"/>
      <!--ebb: unclear contains a text node, so this catches both start and end tag. -->
        <replace pattern="</?retrace.*?>" replacement=""
                 message="retrace-SGA batch replacement"/>
      <!--ebb: retrace contains a text node, so this catches both start and end tag. -->

See Joel Kalvesmaki's Balisage papers on
tan:diff (2021) and tan:collate 2022 (this week)

Normalizing/collating with TAN:Diff XSLT Library

Define regex patterns you need (Python)

RE_PARA = re.compile(r'<p\s.+?/>')
RE_INCLUDE = re.compile(r'<include.*?/>')
RE_HEAD = re.compile(r'<head.*?/>')
RE_AB = re.compile(r'<ab.*?/>')
RE_ADDEND = re.compile(r'</add>')
RE_NOTE_START = re.compile(r'<note.*?>')
RE_NOTE_END = re.compile(r'</note>')
RE_DELSTART = re.compile(r'<del.*?>')
RE_DELEND = re.compile(r'</del>')
RE_SGA_ADDSTART = re.compile(r'<sga-add.+?sID.+?>')
RE_SGA_ADDEND = re.compile(r'<sga-add.+?eID.+?>')
RE_MDEL = re.compile(r'<mdel.*?>.+?</mdel>')
RE_SHI = re.compile(r'<shi.*?>.+?</shi>')
RE_METAMARK = re.compile(r'<metamark.*?>.+?</metamark>')
RE_HI = re.compile(r'<hi\s.+?/>')
RE_PB = re.compile(r'<pb.*?/>')
RE_LB = re.compile(r'<lb.*?/>')
RE_LG = re.compile(r'<lg[^<]*/>')
RE_L = re.compile(r'<l\s[^<]*/>')
RE_CIT = re.compile(r'<cit\s[^<]*/>')
RE_QUOTE = re.compile(r'<quote\s[^<]*/>')
RE_OPENQT = re.compile(r'“')
RE_CLOSEQT = re.compile(r'”')
RE_GAP = re.compile(r'<gap\s[^<]*/>')
RE_sgaP = re.compile(r'<milestone[^<]+?unit="tei:p.+?/>')
RE_MILESTONE = re.compile(r'<milestone.+?>')
RE_MULTI_LEFTANGLE = re.compile(r'<{2,}')
RE_MULTI_RIGHTANGLE = re.compile(r'>{2,}')

Normalize the markup so it's comparable

def normalize(inputText):
    return RE_MULTI_LEFTANGLE.sub('<',\
        RE_MULTI_LEFTANGLE.sub('>', \
        RE_INCLUDE.sub('', \
        RE_AB.sub('', \
        RE_HEAD.sub('', \
        RE_AMP.sub('and', \
        RE_MDEL.sub('', \
        RE_SHI.sub('', \
        RE_HI.sub('', \
        RE_LB.sub('', \
        RE_PB.sub('', \
        RE_PARA.sub('<p/>', \
        RE_sgaP.sub('<p/>', \
        RE_MILESTONE.sub('', \
        RE_LG.sub('<lg/>', \
        RE_L.sub('<l/>', \
        RE_CIT.sub('', \
        RE_QUOTE.sub('', \
        RE_OPENQT.sub('"', \
        RE_CLOSEQT.sub('"', \
        RE_GAP.sub('', \
        RE_DELSTART.sub('<delstart/>', \
        RE_DELEND.sub('<delend/>', \
        RE_ADDSTART.sub('<addstart/>', \
        RE_ADDEND.sub('<addend/>', \
        RE_MOD.sub('', \
        RE_METAMARK.sub('', inputText))))))))))))))))))))))))))).lower()

Find and replace the regex patterns before feeding to collateX

Tokenization error


    <app>
		<rdgGrp n="['spot,', 'and', 'endeavoured,']">
			<rdg wit="f1818">spot, and endeavoured, </rdg>
			<rdg wit="f1823">spot, and endeavoured, </rdg>
			<rdg wit="fThomas">spot, and endeavoured, </rdg>
			<rdg wit="f1831">spot, and endeavoured, </rdg>
		</rdgGrp>
		<rdgGrp n="['spotand', 'endeavoured']">
			<rdg wit="fMS">spot& endeavoured </rdg>
		</rdgGrp>
	</app>

  . . . the spot<add eID="c57-0117__main__d3e21951"/> & endeavoured . . .

In the fMS source:

Repairing the tokenization error


 if event == pulldom.START_ELEMENT and node.localName in inlineEmpty:
            output += '\n' + node.toxml() + '\n'

by adding newline characters around markup nodes

Repairing the tokenization error

corrected this output...

<app>
		<rdgGrp n="['spot', '&lt;addend/&gt;']">
			<rdg wit="fMS">spot 
              &lt;add eID=&quot;c57-0117__main__d3e21951&quot;/&gt; </rdg>
		</rdgGrp>
		<rdgGrp n="['spot,']">
			<rdg wit="f1818">spot, </rdg>
			<rdg wit="f1823">spot, </rdg>
			<rdg wit="fThomas">spot, </rdg>
			<rdg wit="f1831">spot, </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['and']">
			<rdg wit="f1818">and </rdg>
			<rdg wit="f1823">and </rdg>
			<rdg wit="fThomas">and </rdg>
			<rdg wit="f1831">and </rdg>
			<rdg wit="fMS">&amp; </rdg>
		</rdgGrp>
	</app>

but created a new problem...

<app>
		<rdgGrp n="['for', 'there']">
			<rdg wit="f1818">for there </rdg>
			<rdg wit="f1823">for there </rdg>
			<rdg wit="fThomas">for there </rdg>
			<rdg wit="f1831">for there </rdg>
			<rdg wit="fMS">for there </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['', '', '&lt;addstart/&gt;']">
			<rdg wit="fMS">&lt;lb n=&quot;c57-0118__main__4&quot;/&gt; 
              &lt;lb n=&quot;c57-0118__left_margin__1&quot;/&gt; &lt;add hand=&quot;#pbs&quot; 
              sID=&quot;c57-0118__left_margin__d3e21996&quot;/&gt; </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['was']">
			<rdg wit="f1818">was </rdg>
			<rdg wit="f1823">was </rdg>
			<rdg wit="fThomas">was </rdg>
			<rdg wit="f1831">was </rdg>
			<rdg wit="fMS">was </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['&lt;addend/&gt;']">
			<rdg wit="fMS">&lt;add eID=&quot;c57-0118__left_margin__d3e21996&quot;/&gt; </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['no', 'sign', 'of', 'any']">
			<rdg wit="f1818">no sign of any </rdg>
			<rdg wit="f1823">no sign of any </rdg>
			<rdg wit="fThomas">no sign of any </rdg>
			<rdg wit="f1831">no sign of any </rdg>
			<rdg wit="fMS">no sign of any </rdg>
		</rdgGrp>
	</app>
	<app>
		<rdgGrp n="['violence']">
			<rdg wit="fMS">violence </rdg>
		</rdgGrp>
		<rdgGrp n="['violence,']">
			<rdg wit="f1818">violence, </rdg>
			<rdg wit="f1823">violence, </rdg>
			<rdg wit="fThomas">violence, </rdg>
			<rdg wit="f1831">violence, </rdg>
		</rdgGrp> </app>

Solutions

Revise the tokenizing algorithm! What do we understand?
- <add> in S-GA witness is not as meaningful as <add> in Thomas
  - Why? Because S-GA shows "genetic" handwriting, insertions over its own text.
  - Thomas text is a purposeful and complete new insertion into the 1818 edition.
- Redo the inputs to distinguish <sga-add/> elements from Thomas <add>...</add> elements.
- Tokenize Thomas text adds: "<add> . . . .</add>"
  - This is now one complete token and its own unit of contrast from the other editions that can't be broken up.
- Tokenize whole <del>...</del> elements too, S-GA + Thomas
  - In both, deleted text needs to be collated as part of deleted passage, not broken up wherever a word or punctuation happens to align.
  - SGA: usually false starts, sometimes longer unwanted passages.
  - Sequestering the deleted passages reliably isolates them in the collation output

Collation Pre-processing vs. Post-processing

When to make the interventions?

Brittle: Try to prevent yourself from hand-correcting or XSLT-correcting the outputs; gets complicated fast
Pre-process as much as possible...but you can't solve every snag this way, and this also gets complicated fast
- Did I really need to make 15 different tiny files out of one collation unit to align around three long added passages in the S-GA and the 1831 edition? That's hand-work, too: brittle.
Setting up <add>...</add> and <del>...</del> to be sequestered means they NEVER are allowed to line up, even in parts. But the collation outputs them in strange places
Is it okay to post-process the output with XSLT to adjust the alignments?
- Only if you can establish a reliable pattern for doing it.

What we're doing with the collation data

Making a “stand-off“ Spine (info + pointers to collation data)
Generating the edition files with collation data marked “inline”

Interface production pipeline

Output files are copied into the post-collation repo:
https://github.com/FrankensteinVariorum/fv-postCollation
Files in the postColl-workspace/P\d-output/ directories
- become the finalized versions determined to be ready for publishing in the Variorum digital edition
- XSLT pipeline "raises" the distinct edition files with their "hotspot" markers of variation from each other
Before the post-processing pipeline, we need to:
- try to ensure accuracy of collation!
- check and correct by:
  - correcting the Python script feeding the collation to handle repeated kinds of problems
  - correcting by hand the errors that are difficult to handle programmatically

“Spine” data model = standoff use of TEI critical apparatus:
- coordinates data on variance: piece by piece (vertebra) which specific passages line up and where they differ.
- points to specific locations in separate edition files
- outputs source texts prior to normalization: used to construct the five individual TEI editions holding "hotspot"markup at moments of variance.
- includes data on normalization: e.g. normalized tokens used in collation process
- includes processed data, like maximum edit-distance, at each variant location

Tokenization and alignment is a research method.

There will be more than one way to express it.

Your alignment decisions express your theory of your text.

Reflections on machine-assisted collation and the TEI critical apparatus

We (scholars and programmers, and scholarly programmers) need to share our algorithms for tokenization and alignment, together with the editions and interfaces we design with it.

Launch Phase Completed

One third of the collation is being displayed at https://frankensteinvariorum.github.io/
This (first) portion of the novel helped us test our ”spine” data model and the Variorum reader web interface

Letter III passage: “Thomas” viewpoint

Letter III passage ("Remember me...”) “Thomas” viewpt

Letter III passage ("Remember me...”): “1831” viewpt

A heavily revised passage, showing the MS notebook view

”Heatmap” view, showing variation intensity as blocks with circles color-coded by edition. Selecting a circle on the heatmap view displays the edition and its variants.

Surveying Frankenstein’s five versions | Source XSLT

Legend

1818

Thm

1823

1831

Alignments, gaps, and comparative lengths of each collation unit

chapter heading or other structural boundary

Mouse over a black box...

How do we finish this?

Improve and complete the collation
- Problem: we rushed the collation refinement to figure out how to deliver the interface!
- Problem: our lives changed and the pandemic happened; old documentation is patchy
- Solution: slow, steady review and revision and documentation of our collation algorithms
Test and revise the UX of the navigation and reading interface*
- *with recursive cycles of error-correction (see stage 1)
Document our edition data model for the TEI / Markup community

Revisiting the Gothenberg Model

Error correction is a cyclical process!

We need to:

Inspect our collation results
Understand and categorize the kinds of errors we see
Revise the Python script OR the collation method
Re-run the collation
Re-inspect the results
Decide when it's okay to hand-correct
Minimize hand-correction as much as possible

FV Back-end to Front-end

Strengths

Data model for production pipeline
Collation spine constructs critical edition files in TEI XML
Interface delivers collation data for passages in each edition

Weaknesses

Error-prone tokenizing problem in the Python feed!
Python script is hard to read!
Need more legible preparation code

Solutions

Rewrite the Python, figure out the tokenization problem
Switch to XSLT?: TAN diff (TAN XSLT function library)
Try both and see which is better!

Our ongoing collation work

Frankenstein Variorum GitHub Organization
Original collation work:
- fv-collation repo: https://github.com/FrankensteinVariorum/fv-collation
  - collatexPrep directory (source files + Python + outputs)
Fine-tuning the collation: Testing repos
- Python-to-collateX: https://github.com/FrankensteinVariorum/collateX-Testing
- TAN XSLT: https://github.com/FrankensteinVariorum/TAN-2021
Website interface (to be updated!) https://frankensteinvariorum.github.io/viewer/

Collation projects take much longer to debug than you ever expected
Correct the input machinery, not the output.
- Reduce post-processing to correct output errors!
- Work on the pre-processing.
Machine-assisted processes need a lot of documentation
- for project sustainability
- for reproducibility of data

Adventures in Correcting XML Collation w/ Python and XSLT

By Elisa Beshero-Bondar

Adventures in Correcting XML Collation w/ Python and XSLT

The word by word, comma by comma, and sometimes tag by tag comparison of manuscripts and editions (called “collation”) is notoriously tedious and error-prone. But computer-aided collation is like a power loom that inevitably tangles up threads caught in the machinery. We need new tooling to help us unsnarl the threads. To this point, we aligned variant passages in the Frankenstein Variorum project using a Python script to feed collateX. Now we are experimenting with the Text Alignment Network’s tandiff XSLT to handle the string comparison completely with XPath and XSLT. How far can we take XSLT and Schematron in automating the preparation, collation, and correction of electronic editions?

3 years ago
1,231

Elisa Beshero-Bondar PRO

Professor of Digital Humanities and Chair of the Digital Media, Arts, and Technology Program at Penn State Erie, The Behrend College.

Adventures in Correcting XML Collation with Python and XSLT:

Untangling the Frankenstein Variorum

A Digital Variorum

Machine-assisted collation

What we're collating

Print Publications of Frankenstein in MWS's lifetime:

Manuscript versions:

Collaboration effort across institutions since 2017

Critical and Documentary Editions Leading to the Frankenstein Variorum Project

Motivating Questions

Variorum challenge

The “base markup” for collation

PA Electronic Edition (mid 1990s): 1818 vs 1831

Working with MS versions

Perhaps more to the point, I can say. . .

About that little problem...

“S-GA msColl”

Shelley-Godwin Archive: sample page surface:

S-GA: resequenced / compressed for collation

“Thomas” (much easier...)

Thomas copy marginalia

Added new XML: Thomas Copy

Frankenstein Variorum: The Collation Process

Computer-aided collation: Gothenburg Model

Collation is weaving...

Guiding the machine reading

Which power loom?

XSLT or Python/CollateX?

Which power loom?

XSLT or Python/CollateX?

Tokenization

Tokenization

Tokenization: extracting XML with Pulldom

In the Python script:

Normalization

Normalizing/collating with TAN:Diff XSLT Library

Define regex patterns you need (Python)

Normalize the markup so it's comparable

Tokenization error

Repairing the tokenization error

Repairing the tokenization error

Solutions

Collation Pre-processing vs. Post-processing

What we're doing with the collation data

Interface production pipeline

Launch Phase Completed

Letter III passage: “Thomas” viewpoint

Letter III passage ("Remember me...”) “Thomas” viewpt

Letter III passage ("Remember me...”): “1831” viewpt

Surveying Frankenstein’s five versions | Source XSLT

How do we finish this?

Revisiting the Gothenberg Model

Error correction is a cyclical process!

We need to:

FV Back-end to Front-end

Our ongoing collation work

Adventures in Correcting XML Collation w/ Python and XSLT

More from Elisa Beshero-Bondar