Declarative markup in the time of “AI”

Controlling the semantics of tokenized strings

Elisa Beshero-Bondar

GitHub: @ebeshero | Mastodon: epyllia@indieweb.social

Balisage 2023

Link to these slides: https://bit.ly/declare-string

(a paper responding to a prediction that large language models will render descriptive markup unnecessary.)

Natural Language Processing
- process text as sequence of tokens
- measure clusters, co-occurrences
Text Encoding
- OHCO (express structure)
- overt declaration
- usually structural nested containers
- can mark tokenized grams
  - linguistics markup (word-by-word with attributes)

How do we study texts in Digital Humanities?

John Tenniel's illustration of Humpty Dumpty talking to Alice about words

Text generation by approximation
- moving context windows,
- training data
- stochastic process (random sampling from the training data)
- Stats-based "word math": word-embeddings, vector distance
Some intrepid digital humanists learn how to train their own models.
All of us tinker with an enormous one from Google or Meta or OpenAI or in your code editor / word processor:
- good for code correction
- content completion
- writing stuff for you

NLP and Large Language Models

John Tenniel's illustration of Humpty Dumpty talking to Alice about word semantics

Can the new “AI” collate texts?

Maybe it's optimal for this...
- An LLM operates on word tokens / word embeddings data
- Can it make pair-wise comparisons, based on its tokenizing algorithm?
  - Can it reliably identify differences in pairs of strings? (A to B, A to C, A to D. B to C, B to D, etc.)
  - Can it organize same/similar versions of a text in a group?

Let's test this...

Declarative markup in the time of “AI” Controlling the semantics of tokenized strings Elisa Beshero-Bondar GitHub: @ebeshero | Mastodon: epyllia@indieweb.social Balisage 2023 Link to these slides: https://bit.ly/declare-string (a paper responding to a prediction that large language models will render descriptive markup unnecessary.)

	<lb n="c56-0045__main__2"/>It was on a dreary night of November
	<lb n="c56-0045__main__3"/>that I beheld <del rend="strikethrough"
	xml:id="c56-0045__main__d5e9572">
	<add hand="#pbs" place="superlinear" xml:id="c56-0045__main__d5e9574">the frame on
	whic</add></del> my man comple<del>at</del>
	<add place="intralinear" xml:id="c56-0045__main__d5e9582">te</add>
	<add xml:id="c56-0045__main__d5e9585">ed</add>

	<lb n="c56-0045__main__2"/>It was on a dreary night of November
	<lb n="c56-0045__main__3"/>that I beheld <del rend="strikethrough"
	xml:id="c56-0045__main__d5e9572">
	<add hand="#pbs" place="superlinear" xml:id="c56-0045__main__d5e9574">the frame on
	whic</add></del> my man comple<del>at</del>
	<add place="intralinear" xml:id="c56-0045__main__d5e9582">te</add>
	<add xml:id="c56-0045__main__d5e9585">ed</add>

	<lb n="c56-0045__main__2"/>It was on a dreary night of November
	<lb n="c56-0045__main__3"/>that I beheld <del rend="strikethrough"
	xml:id="c56-0045__main__d5e9572">
	<add hand="#pbs" place="superlinear" xml:id="c56-0045__main__d5e9574">the frame on
	whic</add></del> my man comple<del>at</del>
	<add place="intralinear" xml:id="c56-0045__main__d5e9582">te</add>
	<add xml:id="c56-0045__main__d5e9585">ed</add>

	["It", "was", "on", "a", "dreary",
	"night", "of". "November", "that",
	"I", "beheld"
	"<del>the frame on whic</del>",
	"my", "man",
	"comple", "<del>at</del>", "teed"]

	["It", "was", "on", "a", "dreary",
	"night", "of". "November", "that",
	"I", "beheld"
	"<del>the frame on whic</del>",
	"my", "man",
	"comple", "<del>at</del>", "teed"]

Declarative markup in the time of “AI”

Controlling the semantics of tokenized strings

How do we study texts in Digital Humanities?

NLP and Large Language Models

Text Encoding + NLP methods

(a little unusual, but not new)

Special case: Collating XML documents for the Frankenstein Variorum

Including some markup in the comparison

Normalized strings to compare

Tokenize them!

Nodes on the other side of collation

Real output from the project

Achieving the perfect collation is so. much. work.

How would the language model AI’s handle string collation?

Isn't this what AI is supposed to help us with?

Can the new “AI” collate texts?

Let's test this...

Short test

Successful result should identify four differences

First result: a nice data table

Summary

Most promising results: Claude.ai

Remember...
Successful result should identify four differences

Claude's collation performance

Troubling things about LLMs attempting collation

Q: So, can the new “AI” collate texts?

...But that wasn't really the point of this paper!

A: not really, not validly, not reasonably, not accurately, like a drunk person...

Conjectures about text-generative AI

More conjectures. . .

Let's take a moment to trace how some declarative markup moves through imperative processing...

How declarative methods take control in our project’s XML and Python (1)

How declarative methods take control in our project’s XML and Python (2)

How declarative methods take control in our project’s XML and Python (3)

The editors of the Frankenstein Variorum edition declare

Output declaration

What I learned...

A call for declarative methods in today's AI

Is it possible?

	<p xml:id="novel1_letter4_chapter4_div4_div4_p1">I<hi>T</hi> was on a dreary
	night of November, that I beheld the accomplishment of my toils.</p>

	It was on a dreary night of November that I beheld
	<del>the frame on whic</del> my man
	comple<del>at</del>teed

	<p>IT was on a dreary
	night of November, that I beheld
	the accomplishment of my toils.</p>

	["<p>", "IT", "was", "on", "a", "dreary",
	"night", "of", "November,", "that", "I", "beheld",
	"the", "accomplishment", "of", "my", "toils.", "</p>"]

	<app>
	<rdgGrp n="['that', 'i', 'beheld']">
	<rdg wit="f1818">that I beheld</rdg>
	<rdg wit="f1823">that I beheld</rdg>
	<rdg wit="fThomas">that I beheld</rdg>
	<rdg wit="f1831">that I beheld</rdg>
	<rdg wit="fMS"><lb n="c56-0045__main__3"/>that I beheld</rdg>
	</rdgGrp>
	</app>
	<app>
	<rdgGrp n="['<del> the frame on whic</del>',
	'my', 'man', 'comple',
	'', '<mdel>at</mdel>', 'te', 'ed',
	',', '.', '<del>and</del>']">
	<rdg wit="fMS"><del rend="strikethrough"
	xml:id="c56-0045__main__d5e9572">
	<sga-add hand="#pbs" place="superlinear"
	sID="c56-0045__main__d5e9574"/>the
	frame on whic <sga-add eID="c56-0045__main__d5e9574"/> </del> my man
	comple <mod sID="c56-0045__main__d5e9578"/>
	<mdel>at</mdel>
	<sga-add place="intralinear" sID="c56-0045__main__d5e9582"/>te
	<sga-add eID="c56-0045__main__d5e9582"/>
	<sga-add sID="c56-0045__main__d5e9585"/>ed
	<sga-add eID="c56-0045__main__d5e9585"/>
	<mod eID="c56-0045__main__d5e9578"/>
	<sga-add hand="#pbs" place="intralinear"sID="c56-0045__main__d5e9588"/>,
	<sga-add eID="c56-0045__main__d5e9588"/>.
	<del rend="strikethrough"
	xml:id="c56-0045__main__d5e9591">And</del></rdg>
	</rdgGrp>
	<rdgGrp n="['the', 'accomplishment', 'of', 'my', 'toils.']">
	<rdg wit="f1818">the accomplishment of my toils.</rdg>
	<rdg wit="f1823">the accomplishment of my toils.</rdg>
	<rdg wit="fThomas">the accomplishment of my toils.</rdg>
	<rdg wit="f1831">the accomplishment of my toils.</rdg>
	</rdgGrp>
	</app>

	+---------------------+---------------------------+-------------------------+
	\| Manuscript \| 1818 edition \| 1831 edition \|
	+---------------------+---------------------------+-------------------------+
	\| It was on a dreary \| It was on a dreary \| It was on a dreary \|
	\| night of November \| night of November \| night of November \|
	\| that I beheld the \| that I beheld the \| that I beheld the \|
	\| frame on which my \| accomplishment of my \| accomplishment of my \|
	\| man compleated. \| toils. \| toils. \|
	\| And with an anxiety \| With an anxiety that \| With an anxiety that \|
	\| that almost amounted\| almost amounted to agony, \| almost amounted to agony\|
	\| to agony I collected\| I collected the \| I collected the \|
	\| instruments of life \| instruments of life \| instruments of life \|
	\| around me that I \| around me, \| around me, \|
	\| might infuse a spark\| that I might infuse a \| that I might infuse a \|
	\| of being into the \| spark of being into the \| spark of being into the \|
	\| lifeless thing that \| lifeless thing that lay \| lifeless thing that lay \|
	\| lay at my feet. \| at my feet. \| at my feet. \|
	+---------------------+---------------------------+-------------------------+

	<p>It was on a dreary night of November, that I beheld the
	<app>
	<rdg wit="#MS">frame on which my man compleated</rdg>
	<rdg wit="#1818 #1831">accomplishment of my toils</rdg>.
	</app>
	With an anxiety that almost amounted to agony, I collected the instruments of
	life around me, that I might infuse a spark of being into the lifeless thing that
	lay at my feet.</p>

	<p>It was on a dreary night of November<app>
	<rdg wit="#MS">,</rdg>
	<rdg wit="#1818 #1831" />
	</app> that I beheld the
	<app>
	<rdg wit="#MS">frame on which my man compleated</rdg>
	<rdg wit="#1818 #1831">accomplishment of my toils</rdg>.
	</app>
	With an anxiety that almost amounted to agony,
	I collected the instruments of life around me,
	that I might infuse a spark of being into the
	lifeless thing that lay at my feet.</p>

	inlineVariationEvent = ['head', 'del', 'mdel', 'add', 'note', 'longToken']

	ignore = ['sourceDoc', 'xml', 'comment', 'include',
	'addSpan', 'handShift', 'damage',
	'unclear', 'restore', 'surface', 'zone', 'retrace']

	def extract(input_xml):
	"""Process entire input XML document, firing on events"""
	doc = pulldom.parse(input_xml)
	output = ''
	for event, node in doc:
	if event == pulldom.START_ELEMENT and node.localName in ignore:
	continue
	# ebb: The following handles our longToken and longToken-style elements:
	# complete element nodes surrounded by newline characters
	# to make a long complete token:
	if event == pulldom.START_ELEMENT and node.localName in inlineVariationEvent:
	doc.expandNode(node)
	output += '\n' + node.toxml() + '\n'
	# stops the problem of forming tokens that fuse element tags to words.
	elif event == pulldom.START_ELEMENT and node.localName in blockEmpty:
	output += '\n' + node.toxml() + '\n'
	# ebb: empty inline elements that do not take surrounding white spaces:
	elif event == pulldom.START_ELEMENT and node.localName in inlineEmpty:
	output += node.toxml()
	# non-empty inline elements: mdel, shi, metamark
	elif event == pulldom.START_ELEMENT and node.localName in inlineContent:
	output += '\n' + regexEmptyTag.sub('>', node.toxml())
	# output += '\n' + node.toxml()
	elif event == pulldom.END_ELEMENT and node.localName in inlineContent:
	output += '</' + node.localName + '>' + '\n'
	# elif event == pulldom.START_ELEMENT and node.localName in blockElement:
	# output += '\n<' + node.localName + '>\n'
	# elif event == pulldom.END_ELEMENT and node.localName in blockElement:
	# output += '\n</' + node.localName + '>'
	elif event == pulldom.CHARACTERS:
	# output += fixToken(normalizeSpace(node.data))
	output += normalizeSpace(node.data)
	else:
	continue
	return output

	<app>
	<rdgGrp n="['<del>to his statement, which was delivered</del>',
	'to him with interest for he spoke']">
	<rdg wit="fThomas"><del rend="strikethrough">to his statement,
	which was delivered</del> <add>to him with interest
	for he spoke</add></rdg>
	</rdgGrp>
	<rdgGrp n="['to his statement, which was delivered']">
	<rdg wit="f1818"><longToken>to his statement, which was
	delivered</longToken></rdg>
	<rdg wit="f1823"><longToken>to his statement, which was
	delivered</longToken></rdg>
	<rdg wit="f1831"><longToken>to his statement, which was
	delivered</longToken></rdg>
	</rdgGrp>
	</app>