The Internationalization of the Text Encoding Initiative

and our understanding of text

Elisa Beshero-Bondar, PhD

Chair, TEI Technical Council | Professor of Digital Humanities, Penn State Behrend

Keynote for Digital Humanities and the Power of Collaboration: Expanding Connections from Local to Global Symposium at Fukuoka University, Japan, 2025-03-10

Thank you for inviting me to speak!

今日はご招待いただきありがとうございます。

Kyō wa go shōtai itadaki arigatōgozaimasu.

Topics of this presentation

How has multicultural and multilingual research changed the TEI?
Reconciling structural differences in TEI
Writing systems that challenge the limits of computer technology and the TEI

How has multicultural and multilingual research changed the TEI?

encoding of units of measure
encoding of ruby annotations
encoding of sex and gender
encoding of "born-digital" works

Text Encoding Initiative (TEI)

Interoperation: Can texts that are prepared for machine processing in one computer system be understood by other systems?

Interchange: Can the machine-readable parts of the texts be understood by humans, who can work with them as needed, without additional information?

An international community

A set of shared Guidelines for encoding machine-readable texts
Grounded in humanities and social sciences / cultural heritage texts
Originates in 1987, formalized by 1994
Founding priority: Guidelines for Text Encoding and Interchange (another possible meaning for "i" in the TEI)

Text Encoding Initiative (TEI)

Big community around the world; features an annual conference and an academic journal, as well as the TEI Guidelines for text encoding.
- 2024 Conference in Buenos Aires: primary language in Spanish
TEI tags are written by English-speakers: Is this is a problem for non-English speakers?
- Acceptance around the world of English as the lingua franca for digital communities
- Non-English speaking communities request and contribute explanations of the tags in their own languages...
- ...but nearly always(?) want to just use the same tag names as everyone else.
- This reduces confusion over the interpretation of the elements / attributes which when their tag names not clearly translate into other langauges.
Lately: new emphasis on internationalizing the Guidelines
Lately: new interest in encoding strategies for vertical and right-to-left languages

ISO: an organization for making international standards, connected to systems that we rely on to be interoperational

An ISO standard for machine-readable dates important as an option for TEI attributes

How do we define a measurement of time?

ISO attempts to set precise standards based on measurable physical properties of our universe. According to ISO, a "second" in time is defined thus:

The second is the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium-133 atom.

Contemporary nuclear science prevails in this measurement. Is it applicable to explanations of time duration from past centuries?

A nuclear physics lab can consistently measure the passage of one second from observing subatomic particles. This is more precise than watching a spring-and-weight driven watch or pendulum clock.

A problem with encoding measurements in the TEI

Before 2017, the TEI Guidelines examples of coding units of measure referenced them to a defined, universal ISO standard.
Naoki Kokaze, then a graduate student from University of Tokyo, addressed the TEI Conference in 2017 with a poster presentation
His project required documenting now-obsolete local Japanese units of measure, from particular regions and villages, and decoding them in relation to other local systems.

Text

Naoki Kokaze's ticket: #1707 (2017), resolved in 2019

Naoki Kokaze asked the TEI for a new outlook:
- The TEI should permit encoding of past knowledge systems associated with historic documents.
- Practicing TEI should mean we step away from what our computer systems "know" by default.
- Our computational processing in TEI should not prioritize only a current Western way of knowing and measuring.
Could the TEI create new data structures to allow for defining nonstandard units of measure?
That would allow processing, and calculating equivalences between different nonstandard measuring systems.

The TEI Technical Council worked together with Naoki on a new data structure.
Since July 2019, TEI can express nonstandard historical and local measuring systems
No longer do we assume that units of measurement rely only on ISO definitions.
TEI encoders now have examples of how to define and work with nonstandard measurement systems, as well as standard ones.

Standard weights and measures from ancient Egypt

4 digits = a palm

Digit-al Humanities

Visualizing New Kingdom units of measure (1500 - 1000 BCE) as they relate to one another

"Ruby release" of the TEI Guidelines

TEI P5 release 4.2.0 (February 25, 2022)

New elements have been introduced for the encoding of ruby annotations, a particular method of glossing runs of text which is common in East Asian scripts (#2054, with thanks to Kiyonori Nagasaki, Satoru Nakamura, Kazuhiro Okada, Duncan Paterson, and Martin Holmes):
- The ruby element contains a passage of base text along with its associated ruby gloss(es).
- The rb element contains the base text annotated by a ruby gloss.
- The rt element contains a ruby text, an annotation closely associated with a passage of the main text.
- A subsection on Ruby Annotations has been added, which is also referenced from several suitable places in the Guidelines.
With this first take we hope to initiate further discussion and the implementation of additional use cases.

Broad applications of Ruby TEI encoding

Encoding of texts that provide pronunciation guides
Pair logographic signs with phonemes in multiple language systems
Ongoing discussion: Possibly need examples for long annotations?
- See TEI ticket #2601 (opened October 2024)

Revising sex and gender in the TEI Guidelines

An old problem in the TEI and for the digital humanities scholarly community
Another problem with over-reliance on ISO.
TEI used to conform to a now-dated standard, ISO/IEC 5218, which provided numerical values for sex:

0 = Not known;
1 = Male;
2 = Female;
9 = Not applicable.

Prioritizing simple machine-encoding led to problems for many parts of the TEI Guidelines
TEI's early reliance on the ISO led to problems in the TEI Guidelines over:
- Representation of sex / lack of gender
- Confusion with physical states/traits
- Language of Names/Dates chapter and related elements + attributes
The TEI and the ISO communities have both evolved over decades, and influence each other

TEI's tensions with ISO

Persistent problems/calls for revision

Assumptions in the language about gender / sex in the Guidelines
Calls in the community for something more than an encoding of linguistic morphological gender
- connected with personography and prosopography structures in the Guidelines.
- <persona> available since 2016 as distinct from person.
  - TEI projects are better able to describe invented, performed identities.
- Review state vs. trait encoding and discussion in Names, Dates, People, and Places chapter

Release 4.5.0: ‘The Release of One’s Own’

New encoding features (October 2022)

Sex and gender have been revised in the Guidelines (#2189, #2190), this includes
- The introduction of the new element gender specifying the gender identity of a person, persona, or character.
- The introduction of the new datatype teidata.gender defining the range of attribute values used to represent the gender of a person, persona, or character.
- The revision of the discussion of traits and states in the subsection on Basic Principles.
- Documentation and guidance in various places of Chapter 13 Names, Dates, People, and Places.
- The revision of the elements sex, person, persona as well as the datatype teidata.sex.

What we learned in process of revision

We create document data models with TEI
Our encoding decisions express a theory of text
- TEI encoding can theorize about how texts respond to culture
  - about sex and gender
  - about class and ethnic and language differences
  - about nature and classification of life
The detailed curation of prosopography can advance a theory of past lives
- e.g. Digital Dinah Craik and Digital Mitford
- e.g. Henry III Fine Rolls (Taxation documents)
- e.g. Carl Maria von Weber Gesamtausgabe
- e.g. Shibusawa Eichi Diary / Biographical Materials

"Born digital" texts and Computer Mediated Communication (CMC)

New Guidelines Chapter on CMC (July 2024 Release)
Could help to encode:
- written exchanges in chats and forums
- interactions with artificial intelligence systems
- conversations in internet video meetings
Shared features:
- sequenced interactions
- human-machine interface
- machine transmission over network (usually the internet)
- Could be posts, spoken media, nonverbal interactions
(Ongoing experiment: can CMC apply to encoding email listserv archives?)

CMC data: Interactions of humans OR machines, mediated by machines

This special interest group seeks to create TEI encoding practices for various forms of digitally created content. It should also cover processable text on physical media. Some examples:

e-literature
e-mail correspondence
social media postings (see also new CMC chapter!)
floppy disk magazines
program code (also in print and manuscript)
punchcards (as "paperware")
forms

The Computable Text and Media SIG will likely intersect with other SIGs at some points (e.g. CMC and Correspondence). One of the challenges will be to elaborate neat and simple extensions and/or practices to generalize or broaden the scope of TEI to include specific aspects of computable text and media.

Convener: Torsten Roeder (Center for Philology and Digitality, University of Würzburg).

Computable Text and Media SIG

TEI encoding “born digital” work

Not just about hand-written manuscripts now
TEI to preserve metadata that represents machinery generating source documents
Archive fragile texts that disappear from old media formats (CD-ROM, 1990s computers and encoding formats)
Complex work for historians of culture and technology

TEI against reductiveness?

Computational culture has a problem with bias and reductiveness (let us count the ways)
- machine learning/AI models trained on Western web media
- "global north" economy / assumptions about who uses computers
- emphasis on the "now" and lack of awareness of other times/places/ways
Standard forms that collect data about people: also reductive
Can TEI intervene?
- As humanities scholarship: applying pressure against reductive paradigms
- As alternative modeling; systematically investigating structures and forms of other times and places.

Reconciling structural differences in TEI

TEI as an encoding system that negotiates

TEI Guidelines can be frustrating because they provide so many options to decide upon for encoding a text.
But the TEI can also construct bridges, harmonize differences between different approaches.
TEI “standoff annotation”: markup that refers to other documents or other portions of a document.
- Can be machine-generated
- can provide a way to
  - to negotiate between different encoding systems
  - to analyze what is lost in translation
  - to mark variation between distinct versions of a text.

Standoff Annotation: links, pointers, commentary

Related to w3C Web Annotation Model

See TEI ticket #1745: Led to the development of the <standoff> TEI element for expressing linked data

Standoff Annotation: links, pointers, commentary

<standoff> is not the only way to perform "standoff annotation" in TEI.
But it allows for many different kinds of annotation:

<standoff>: “Functions as a container element for linked data, contextual information, and stand-off annotations embedded in a TEI document.”
Added to TEI Guidelines in Release 4.3.0 (August 2021)
Many projects do standoff work without using this element!

An Idea: TEI Standoff is Cosmopolitan

The following slides are from my own (Anglo-European) project history with the TEI, exploring forms of “standoff annotation”
- I began studying the TEI a little over 10 years ago as a scholar of 19th-century English literature
- The TEI led me very quickly to cosmopolitan projects and collaborations:
- TEI community has shown me the TEI itself as a cosmopolitan coding method

A cosmopolitan TEI makes itself aware of multiple formats, multiple languages, multiple approaches

Investigating translation history

Garci Rodríguez de Montalvo's Amadis de Gaula from 1500s translated from Castillian Spanish to modern English by Robert Southey (1803):
- three centuries apart
- medieval Catholic imperial Spain vs. Protestant imperial England
We study Southey's "sense-for-sense" rather than "word-for-word" translation and how it changed Montalvo's text.

Amadis in Translation project (https://newtfire.org/amadis/)

Investigating translation history

Amadis in Translation project (https://newtfire.org/amadis/)

This translation study applies the TEI:
- to align Montalvo's and Southey's texts
- to discover: What did Southey compress and remove from Montalvo? And what did he add to Montalvo?

Investigating translation history

Amadis in Translation project (https://newtfire.org/amadis/)

<cl xml:id="M0_p1_c63">
	<milestone unit="said" resp="#Garinter" ana="start"/>No sin causa tiene/>.</cl>
 <cl xml:id="M0_p1_c64">Esto hecho recogida toda la compaña hizo en dos
                        palafrenes cargar el león y el ciervo:</cl>
<cl xml:id="M0_p1_c65">y llevarlos a la villa con gran plazer.</cl>
<cl xml:id="M0_p1_c66">Donde siendo de tal huésped la reina avisada:</cl>
<cl xml:id="M0_p1_c67">los palacios de grandes y ricos atavíos/</cl>
<cl xml:id="M0_p1_c68"><seg xml:id="M0_p1_c68_1">y las mesas puestas</seg> 
       <seg xml:id="M0_p1_c68_2">hallaron:</seg></cl>
<cl xml:id="M0_p1_c69">en la una más alta se sentaron los reyes:</cl>
<cl xml:id="M0_p1_c70">y en otra junto con ella Elisena su hija:</cl>
<cl xml:id="M0_p1_c71">y allí fueron servidos como en casa de tan buen hombre
                        ser devía.</cl>

TEI <cl> and <seg> elements marks units of "sense": clauses, "clause-like" passages, and phrases in Montalvo's Amadis.
This markup is designed to provide locational markers for reference in translated files.
The Spanish structure of this document does not match English sentence structure, but units of "sense" can be traced.

Investigating translation history

Amadis in Translation project (https://newtfire.org/amadis/)

<cl xml:id="M0_p1_c63">
	<milestone unit="said" resp="#Garinter" ana="start"/>No sin causa tiene/>.</cl>
 <cl xml:id="M0_p1_c64">Esto hecho recogida toda la compaña hizo en dos
                        palafrenes cargar el león y el ciervo:</cl>
<cl xml:id="M0_p1_c65">y llevarlos a la villa con gran plazer.</cl>
<cl xml:id="M0_p1_c66">Donde siendo de tal huésped la reina avisada:</cl>
<cl xml:id="M0_p1_c67">los palacios de grandes y ricos atavíos/</cl>
<cl xml:id="M0_p1_c68"><seg xml:id="M0_p1_c68_1">y las mesas puestas</seg> 
       <seg xml:id="M0_p1_c68_2">hallaron:</seg></cl>
<cl xml:id="M0_p1_c69">en la una más alta se sentaron los reyes:</cl>
<cl xml:id="M0_p1_c70">y en otra junto con ella Elisena su hija:</cl>
<cl xml:id="M0_p1_c71">y allí fueron servidos como en casa de tan buen hombre
                        ser devía.</cl>

TEI <cl> and <seg> elements marks units of "sense": clauses, "clause-like" passages, and phrases in Montalvo's Amadis.
This markup is designed to provide locational markers for reference in translated files.

Investigating translation history

Amadis in Translation project (https://newtfire.org/amadis/)

<p> <!-- <s> elements preceding  -->
 <s>
  <anchor ana="start" type="add"/>When Garinter saw him fall,<anchor ana="end"/>
  <anchor ana="start" corresp="#M0_p1_c62"/>he said within himself<anchor ana="end"/>
  <anchor ana="start" corresp="#M0_p1_c63"/>not without cause is that Knight famed
                  to be the best in the world.<anchor ana="end"/>
 </s>
  <s>
   <anchor ana="start" corresp="#M0_p1_c64"/>Meanwhile their train came up, and then
     was their prey and venison laid on two horses<anchor ana="end"/>
 <anchor ana="start" corresp="#M0_p1_c65"/>and carried to the City.<anchor ana="end"/>
 </s>
</p>

Markup of Southey's translation:
- @corresp attribute points to corresponding passages of sense in the Montalvo source.
- Southey's additions are marked with @type="add". (Omissions are demonstrated in skipped numbers—not shown in this short example.)
- relatively "flat" document = standoff annotation of Southey's alignment with Montalvo's text.

Investigating translation history

Amadis in Translation project (https://newtfire.org/amadis/)

Alignment table showing passages added and omitted in Southey's translation

Investigating translation history

Amadis in Translation project (https://newtfire.org/amadis/)

Visualization (XSLT to SVG) as a diagram of aligned content, and proportions unique to Montalvo and Southey

TEI to compare versions encoded differently

Collaborating with scholars of medieval Spain (Stacey Triplette and Helena Bermudez Sabel) led us:
- to write TEI to structure and measure "sense by sense" translation
- to understand TEI as a language that could bridge our different fields (19c England vs 14thc Spain) through digital humanities.
Soon after, I joined a group of scholars to update and revitalize the digital representation of Mary Shelley's novel Frankenstein.
This new project also explored how TEI can construct bridges—this time between very different structural encodings in order to compare them.

新宿御苑橋
Shinjukugyoen-bashi

TEI to compare versions encoded differently

Frankenstein Variorum: https://frankensteinvariorum.org/

Visualizes a collation, or comparison of versions, working with digital editions that were encoded very differently
Designed as a static website for serendipitous browsing and intensive research
Applies the TEI in a JavaScript context to store comparison data and pointers to variant passages

Objectives of the Frankenstein Variorum (FV)

to “upcycle” and connect previous digital editions of Frankenstein:
- PA Electronic Edition: 1990s HTML of two editions
- Shelley-Godwin Archive: complicated genetic edition markup in TEI (page by page encoding of manuscript notebook)
- new editions prepared in simple, structural TEI
to share a nonlinear, divergent edition history
to encourage exploration from one edition to the others

Editions that we compared in the Frankenstein Variorum

FV includes Shelley-Godwin Archive encoding

S-GA diplomatic edition of the 1816 Notebooks,
- encoded surface-by-surface, line-by-line
- To collate this edition with the others, we had to resequence the encoding of the margin notes in the text in reading order
- This was possible thanks to the precise TEI encoding of the S-GA editors: (We could follow the pointers and markers in XSLT processing.)

Shelley-Godwin Archive: sample page surface:

Shelley-Godwin Archive

sample surface encoding from S-GA

<surface xmlns:mith="http://mith.umd.edu/sc/ns1#" lrx="3847" lry="5342" 
partOf="#ox-frankenstein_volume_i" ulx="0" uly="0" 
mith:folio="21r" mith:shelfmark="MS. Abinger c. 56" 
xml:base="https://raw.githubusercontent.com/
umd-mith/sga/master/data/tei/ox/ox-ms_abinger_c56/ox-ms_abinger_c56-0045.xml" 
xml:id="ox-ms_abinger_c56-0045">
  <graphic url="http://shelleygodwinarchive.org/images/ox/ms_abinger_c56/ms_abinger_c56-0045.jp2"/>
  <zone rend="bordered" type="pagination"><line>75</line></zone>
  <zone type="library"><line>21</line></zone>
<!-- lines of text elided here -->
<line>to form. His limbs were in proportion</line>
<line>and I had selected his features <del rend="strikethrough">h</del> as</line>
<line><mod>
        <del rend="strikethrough">handsome</del>
        <del rend="unmarked">.</del>
        <anchor xml:id="c56-0045.01"/>
      </mod>
      <mod>
        <del rend="strikethrough">Handsome</del>
        <add hand="#pbs" place="superlinear">Beautiful</add>
      </mod>; Great God! His</line>

<!-- at the end of the surface encoding, encoding material in a left-margin zone:  --->

<zone corresp="#c56-0045.01" type="left_margin">
    <line><add><mod>
          <del rend="strikethrough">handsome</del>
          <add hand="#pbs" place="superlinear">beautiful.</add>
        </mod></add></line>
  </zone>
<!-- other marginal insertions encoded -->
</surface>

Collating when the editions are so different (1)

Align and “chunk”

Best not to collate the entire novel files to prevent severe alignment errors!
We prepared 33 collation units (or "chunk files") sharing common starting and ending points.
Edition files of the same chunk are collated together

Collating when the editions are so different (2)

Prescribe rules to direct the machine-assisted collation

Our Python collation script
- works with collateX library, extensively customized
- Prepares collateX to work around markup differences
  - (identify and unite words split around line-endings in S-GA)
- to identify what features can be ignored/skipped over for collation purposes
  - (e.g. markup of pagination, line-by-line encoding in S-GA)
- to normalize: identify what apparently different features are the same:
  - <milestone type='paragraph'> is same as <p>
  - "&" is not different from "and"
- Prescribes output in form of TEI critical apparatus :
  - coordinate information on which editions align and what normalized tokens/strings they share at each instance of variation.
  - (See Parallel Segmentation encoding in TEI Guidelines)

Markup of text structure compared across Variorum:
- Volume (print editions only), letter, chapter
- Paragraph, poetry line-groups and lines
- Notes

Markup of manuscript events included in Variorum comparison: deletion, insertion, gap

Normalizing algorithm:
- Decide what marks are equivalent)
- Ignore but preserve other markup in collation process, also abbreviations, capitalization.

Including markup in the comparison

Manuscript (from Shelley-Godwin Archive):

<lb n="c56-0045__main__2"/>It was on a dreary night of November 
<lb n="c56-0045__main__3"/>that I beheld <del rend="strikethrough" 
xml:id="c56-0045__main__d5e9572">
       <add hand="#pbs" place="superlinear" xml:id="c56-0045__main__d5e9574">the frame on
         whic</add></del> my man comple<del>at</del>
<add place="intralinear" xml:id="c56-0045__main__d5e9582">te</add>
<add xml:id="c56-0045__main__d5e9585">ed</add>

1818 (from PA Electronic edition)

<p xml:id="novel1_letter4_chapter4_div4_div4_p1">I<hi>T</hi> was on a dreary 
night of November, that I beheld the accomplishment of my toils.</p>

What matters for meaningful comparison?
- Text nodes
- <del> and <p> markup
What doesn't matter?
- <lb/> elements, attribute nodes
- <hi>? *In real life we include the <hi> elements as meaningful markup because sometimes they are meaningful for emphasis.

Tokenization and alignment of TEI markup

MS (from Shelley-Godwin Archive):

["It", "was", "on", "a", "dreary", 
"night", "of". "November", "that", 
"I", "beheld" 
"&lt;del&gt;the frame on whic&lt;/del&gt;",
"my", "man", 
"comple", "&lt;del&gt;at&lt;/del&gt;", "teed"]

1818 (from PA Electronic edition)

["&lt;p&gt;", "IT", "was", "on", "a", "dreary", 
"night", "of", "November,", "that", "I", "beheld",
"the", "accomplishment", "of", "my", "toils.", "&lt;/p&gt;"]

Project decision: Treat a deletion as a complete and indivisible event:

a ”long token”. This helps to align other witnesses around it.

TEI critical apparatus code

can be a data structure that builds a bridge between differently coded editions

<app>
	<rdgGrp n="['that', 'i', 'beheld']">
		<rdg wit="f1818">that I beheld</rdg>
		<rdg wit="f1823">that I beheld</rdg>
		<rdg wit="fThomas">that I beheld</rdg>
		<rdg wit="f1831">that I beheld</rdg>
		<rdg wit="fMS">&lt;lb n="c56-0045__main__3"/&gt;that I beheld</rdg>
	</rdgGrp>
</app>
<app>
	<rdgGrp n="['&lt;del&gt; the frame on whic&lt;/del&gt;',
               'my', 'man', 'comple', 
               '', '&lt;mdel&gt;at&lt;/mdel&gt;', 'te', 'ed', 
               ',', '.', '&lt;del&gt;and&lt;/del&gt;']">
		<rdg wit="fMS">&lt;del rend="strikethrough" 
          xml:id="c56-0045__main__d5e9572"&gt;
			&lt;sga-add hand="#pbs" place="superlinear" 
          sID="c56-0045__main__d5e9574"/&gt;the
	      frame on whic &lt;sga-add eID="c56-0045__main__d5e9574"/&gt; &lt;/del&gt; my man
		  comple &lt;mod sID="c56-0045__main__d5e9578"/&gt; 
          &lt;mdel&gt;at&lt;/mdel&gt;
		  &lt;sga-add place="intralinear" sID="c56-0045__main__d5e9582"/&gt;te
          &lt;sga-add eID="c56-0045__main__d5e9582"/&gt;
          &lt;sga-add sID="c56-0045__main__d5e9585"/&gt;ed
		  &lt;sga-add eID="c56-0045__main__d5e9585"/&gt;
          &lt;mod eID="c56-0045__main__d5e9578"/&gt;
          &lt;sga-add hand="#pbs" place="intralinear"sID="c56-0045__main__d5e9588"/&gt;, 
          &lt;sga-add eID="c56-0045__main__d5e9588"/&gt;.
		  &lt;del rend="strikethrough"
		  xml:id="c56-0045__main__d5e9591"&gt;And&lt;/del&gt;</rdg>
	</rdgGrp>
	<rdgGrp n="['the', 'accomplishment', 'of', 'my', 'toils.']">
		<rdg wit="f1818">the accomplishment of my toils.</rdg>
		<rdg wit="f1823">the accomplishment of my toils.</rdg>
		<rdg wit="fThomas">the accomplishment of my toils.</rdg>
		<rdg wit="f1831">the accomplishment of my toils.</rdg>
	</rdgGrp>
</app>

From collation data to spine

“Spine” = data model (dynamic nerve plexus?) holding the variorum together
- standoff use of TEI critical apparatus
  - coordinates data on variance, including normalized tokens and maximum edit-distance values
- points to specific locations in the variorum edition files

An interesting variant passage in Frankenstein

https://frankensteinvariorum.org/viewer/Thomas/vol_1_letter_iv#C06_app7

A Thomas copy edit of Letter IV at an early moment of intense revision

Another interesting variation in Frankenstein

https://frankensteinvariorum.org/viewer/Thomas/vol_1_chapter_iv#C10_app59

where the Creature comes to life in MS and Thomas

How did we make this heatmap?

We made it from the "Standoff Spine" collation data:
XSLT transformation of TEI to SVG

See our Method page for details:

https://frankensteinvariorum.org/

So far this presentation has praised the TEI as “cosmopolitan”

Tokyo International Forum

Photo credit: David B. Cox, photographylife.com

for being able to mark and study texts in ways that reflect their culture, not just ours
for building bridges between different language and encoding structures

Writing systems that challenge the limits of computer technology and the TEI

But I want to conclude by introducing a a difficult problem:

TEI is severely limited for right-to-left (RTL) scripts

right-to-left scripts for Arabic and Hebrew are problematic when marked in left-to-right TEI XML markup
Till Grallert spoke of this at the 2023 TEI keynote address in Paderborn, Germany
- See Open Arabic Periodical Editions Project
- Grallert told us of problems with Arabic typesetting, publishing machinery that cannot easily encompass the variety of symbols and glyphs for moving type.
Editing solutions are a compromise.
- oXygen XML Editor, Kate, VS Code offer some custom helpful but imperfect editing solutions for inline markup
- TEI and XML itself interfere with Unicode directionality of the RTL text contents

<p xml:id="p_23.d1e747" xml:lang="ar">وما برحت الآمال معقودة بأن تبلغ الصحافة عما قليل أشدها ورشدها 
   <lb change="#d2e635 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_9.d2e1448"/>
   لتضاهي صحافة الأمم الراقية في موضوعاتها وتأثيراتها إذ أن العقلاء يذهبون 
    <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_10.d2e1647"/>
    إلى أن صحافتنا مازالت حالها على ما انتهت إليه غير متناسبة مع عمرها الطويل. 
    <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_11.d2e1649"/>
   والمعمر في الأعم من حالاته يشتد ساعده وزنده وتقوى ملكة عقله وعلمه 
   <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_12.d2e1651"/>
    بكثرة تجاربه وأسباب رويته. ولا خير في أمة لا يقوم بشؤونها شيوخ 
    <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_13.d2e1653"/>
    تفاخر بأعمالهم مفاخرتها بعقولهم وطول أعمارهم.</p>

Sample encoding from Till Grallert, Digital edition (TEI XML) of the Arabic monthly journal *al-Muqtabas* (مجلة المقتبس), published by Muḥammad Kurd ʿAlī in Cairo and Damascus between 1906 and 1917/18
https://github.com/openarabicpe/journal_al-muqtabas
Standoff methods can help, but not a complete solution to customizing the TEI

Can the TEI provide alternatives to its international English tagging?

Hugh Cayless (in a working draft) suggests we have a lot of work to do.
- What if TEI found a way to express its tags and their directionality to match the document characters?
- What if we had software to process it? (Would this be an alternate form of XML? Or pushing XML beyond Right-to-Left directionality?)
The problem is larger than the TEI, but it is TEI encoders who may find the path to a more inclusive method of encoding!

Thank you for listening! Any questions?

ご清聴ありがとうございました！

何か質問はありますか？

Go seichō arigatōgozaimashita! Nanika shitsumon wa arimasu'ka?

The Internationalization of the TEI (March 10, 2025)

By Elisa Beshero-Bondar

The Internationalization of the TEI (March 10, 2025)

The internationalization of the TEI and our understanding of text

Elisa Beshero-Bondar PRO

Professor of Digital Humanities and Chair of the Digital Media, Arts, and Technology Program at Penn State Erie, The Behrend College.

The Internationalization of the Text Encoding Initiative

and our understanding of text

Topics of this presentation

How has multicultural and multilingual research changed the TEI?

Text Encoding Initiative (TEI)

An international community

Text Encoding Initiative (TEI)

ISO: an organization for making international standards, connected to systems that we rely on to be interoperational

An ISO standard for machine-readable dates important as an option for TEI attributes

How do we define a measurement of time?

A problem with encoding measurements in the TEI

"Ruby release" of the TEI Guidelines

Broad applications of Ruby TEI encoding

Revising sex and gender in the TEI Guidelines

TEI's tensions with ISO

Persistent problems/calls for revision

Release 4.5.0: ‘The Release of One’s Own’

What we learned in process of revision

"Born digital" texts and Computer Mediated Communication (CMC)

Computable Text and Media SIG

TEI encoding “born digital” work

TEI against reductiveness?

Reconciling structural differences in TEI

TEI as an encoding system that negotiates

Standoff Annotation: links, pointers, commentary

Standoff Annotation: links, pointers, commentary

An Idea: TEI Standoff is Cosmopolitan

Investigating translation history

Investigating translation history

Investigating translation history

Investigating translation history

Investigating translation history

Investigating translation history

Investigating translation history

TEI to compare versions encoded differently

TEI to compare versions encoded differently

Objectives of the Frankenstein Variorum (FV)

Editions that we compared in the Frankenstein Variorum

FV includes Shelley-Godwin Archive encoding

Shelley-Godwin Archive: sample page surface:

Shelley-Godwin Archive

Collating when the editions are so different (1)

Collating when the editions are so different (2)

Including markup in the comparison

Tokenization and alignment of TEI markup

TEI critical apparatus code

can be a data structure that builds a bridge between differently coded editions

From collation data to spine

An interesting variant passage in Frankenstein

Another interesting variation in Frankenstein

How did we make this heatmap?

Writing systems that challenge the limits of computer technology and the TEI

But I want to conclude by introducing a a difficult problem:

TEI is severely limited for right-to-left (RTL) scripts

Can the TEI provide alternatives to its international English tagging?

The Internationalization of the TEI (March 10, 2025)

The Internationalization of the TEI (March 10, 2025)

Elisa Beshero-Bondar PRO

More from Elisa Beshero-Bondar