The Internationalization of the Text Encoding Initiative

and our understanding of text

Elisa Beshero-Bondar, PhD

Chair, TEI Technical Council | Professor of Digital Humanities, Penn State Behrend

Keynote for Digital Humanities and the Power of Collaboration: Expanding Connections from Local to Global Symposium at Fukuoka University, Japan, 2025-03-10

Thank you for inviting me to speak!

 今日はご招待いただきありがとうございます。

Kyō wa go shōtai itadaki arigatōgozaimasu.

Topics of this presentation

  • How has multicultural and multilingual research changed the TEI?
     
  • Reconciling structural differences in TEI
     
  • Writing systems that challenge the limits of computer technology and the TEI

How has multicultural and multilingual research changed the TEI?
 

 

 

  • encoding of units of measure
  • encoding of ruby annotations
  • encoding of sex and gender
  • encoding of "born-digital" works

Interoperation: Can texts that are prepared for machine processing in one computer system be understood by other systems?

Interchange: Can the machine-readable parts of the texts be understood by humans, who can work with them as needed, without additional information?

An international community

  • A set of shared Guidelines for encoding machine-readable texts
  • Grounded in humanities and social sciences / cultural heritage texts
  • Originates in 1987, formalized by 1994
  • Founding priority: Guidelines for Text Encoding and Interchange  (another possible meaning for "i" in the TEI)
  • Big community around the world; features an annual conference and an academic journal, as well as the TEI Guidelines for text encoding.
    • 2024 Conference in Buenos Aires: primary language in Spanish 
  • TEI tags are written by English-speakers: Is this is a problem for non-English speakers?
    • Acceptance around the world of English as the lingua franca for digital communities
    • Non-English speaking communities request and contribute explanations of the tags in their own languages...
    • ...but nearly always(?) want to just use the same tag names as everyone else.
    • This reduces confusion over the interpretation of the elements / attributes which when their tag names not clearly translate into other langauges.
  • Lately: new emphasis on internationalizing the Guidelines
  • Lately: new interest in encoding strategies for vertical and right-to-left languages

ISO: an organization for making international  standards, connected to systems that we rely on to be interoperational

An ISO standard for machine-readable dates important as an option for TEI attributes

How do we define a measurement of time? 

ISO attempts to set precise standards based on measurable physical properties of our universe. According to ISO, a "second" in time is defined thus:

The second is the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium-133 atom.

Contemporary nuclear science prevails in this measurement. Is it applicable to explanations of time duration from past centuries? 

A nuclear physics lab can consistently measure the passage of one second from observing subatomic particles. This is more precise than watching a spring-and-weight driven watch or pendulum clock.

A problem with encoding measurements in the TEI

  • Before 2017, the TEI Guidelines examples of coding units of measure referenced them to a defined, universal ISO standard. 
     
  • Naoki Kokaze, then a graduate student from University of Tokyo, addressed the TEI Conference in 2017 with a poster presentation
  • His project required documenting now-obsolete local Japanese units of measure, from particular regions and villages, and decoding them in relation to other local systems.

Text

Naoki Kokaze's ticket: #1707 (2017), resolved in 2019

  • Naoki Kokaze asked the TEI for a new outlook:
    • The TEI should permit encoding of past knowledge systems associated with historic documents.
    • Practicing TEI should mean we step away from what our computer systems "know" by default.
    • Our computational processing in TEI should not prioritize only a current Western way of knowing and measuring.
  •  Could the TEI create new data structures to allow for defining nonstandard units of measure?
  • That would allow processing, and calculating equivalences between different nonstandard measuring systems.

 

  • The TEI Technical Council worked together with Naoki on a new data structure.
     
  • Since July 2019, TEI can express nonstandard historical and local measuring systems
     
  • No longer do we assume that units of measurement rely only on ISO definitions.
     
  •  TEI encoders now have examples of how to define and work with nonstandard measurement systems, as well as standard ones.

Standard weights and measures from ancient Egypt

4 digits = a palm

Digit-al Humanities 

Visualizing New Kingdom units of measure (1500 - 1000 BCE) as they relate to one another 

"Ruby release" of the TEI Guidelines

  • TEI P5 release 4.2.0 (February 25, 2022)
  • New elements have been introduced for the encoding of ruby annotations, a particular method of glossing runs of text which is common in East Asian scripts (#2054, with thanks to Kiyonori Nagasaki, Satoru Nakamura, Kazuhiro Okada, Duncan Paterson, and Martin Holmes):

    • The ruby element contains a passage of base text along with its associated ruby gloss(es).

    • The rb element contains the base text annotated by a ruby gloss.

    • The rt element contains a ruby text, an annotation closely associated with a passage of the main text.

    • A subsection on Ruby Annotations has been added, which is also referenced from several suitable places in the Guidelines.

     With this first take we hope to initiate further discussion and the implementation of additional use cases.

Broad applications of Ruby TEI encoding

  • Encoding of texts that provide pronunciation guides
  • Pair logographic signs with phonemes in multiple language systems
  • Ongoing discussion: Possibly need examples for long annotations?

Revising sex and gender in the TEI Guidelines

  • An old problem in the TEI and for the digital humanities scholarly community
     
  • Another problem with over-reliance on ISO.
     
  • TEI used to conform to a now-dated standard, ISO/IEC 5218, which provided numerical values for sex:

 

  • 0 = Not known;
  • 1 = Male;
  • 2 = Female;
  • 9 = Not applicable.
  • Prioritizing simple machine-encoding led to problems for many parts of the TEI Guidelines

     
  • TEI's early reliance on the ISO led to problems in the TEI Guidelines over:
    • Representation of sex / lack of gender
    • Confusion with physical states/traits
    • Language of Names/Dates chapter and related elements + attributes
       
  • The TEI and the ISO communities have both evolved over decades, and influence each other

TEI's tensions with ISO

Persistent problems/calls for revision

 

  • Assumptions in the language about gender / sex in the Guidelines
  • Calls in the community for something more than an encoding of linguistic morphological gender

    • connected with personography and prosopography structures in the Guidelines.

    • <persona> available since 2016 as distinct from person.

      • TEI projects are better able to describe invented, performed identities.

    • Review state vs. trait encoding and discussion in Names, Dates, People, and Places chapter

Release 4.5.0: ‘The Release of One’s Own’

New encoding features (October 2022)

  • Sex and gender have been revised in the Guidelines (#2189, #2190), this includes
    • The introduction of the new element gender specifying the gender identity of a person, persona, or character.
    • The introduction of the new datatype teidata.gender defining the range of attribute values used to represent the gender of a person, persona, or character.
    • The revision of the discussion of traits and states in the subsection on Basic Principles.
    • Documentation and guidance in various places of Chapter 13 Names, Dates, People, and Places.
    • The revision of the elements sex, person, persona as well as the datatype teidata.sex.

What we learned in process of revision

  • We create document data models with TEI
  • Our encoding decisions express a theory of text
    • TEI encoding can theorize about how texts respond to culture
      • about sex and gender
      • about class and ethnic and language differences
      • about nature and classification of life
  • The detailed curation of prosopography can advance a theory of past lives

"Born digital" texts and  Computer Mediated Communication  (CMC)

  •  New Guidelines Chapter on CMC (July 2024 Release)
  • Could help to encode:
    • written exchanges in chats and forums
    • interactions with artificial intelligence systems
    • conversations in internet video meetings
  • Shared features:
    • sequenced interactions
    • human-machine interface
    • machine transmission over network (usually the internet)
    • Could be posts, spoken media, nonverbal interactions
  • (Ongoing experiment: can CMC apply to encoding email listserv archives?)

CMC data: Interactions of humans OR machines, mediated by machines

This special interest group seeks to create TEI encoding practices for various forms of digitally created content. It should also cover processable text on physical media. Some examples:

  • e-literature
  • e-mail correspondence
  • social media postings (see also new CMC chapter!)
  • floppy disk magazines
  • program code (also in print and manuscript)
  • punchcards (as "paperware")
  • forms
     

The Computable Text and Media SIG will likely intersect with other SIGs at some points (e.g. CMC and Correspondence). One of the challenges will be to elaborate neat and simple extensions and/or practices to generalize or broaden the scope of TEI to include specific aspects of computable text and media.

 

Convener: Torsten Roeder (Center for Philology and Digitality, University of Würzburg).

Computable Text and Media SIG

TEI encoding “born digital” work

  • Not just about hand-written manuscripts now
     
  • TEI to preserve metadata that represents machinery generating source documents
     
  • Archive fragile texts that disappear from old media formats (CD-ROM, 1990s computers and encoding formats)
     
  • Complex work for historians of culture and technology

TEI against reductiveness?

  • Computational culture has a problem with bias and reductiveness (let us count the ways)
    • machine learning/AI models trained on Western web media
    • "global north" economy / assumptions about who uses computers
    • emphasis on the "now" and lack of awareness of other times/places/ways
  • Standard forms that collect data about people: also reductive
  •  Can TEI intervene?
    • As humanities scholarship: applying pressure against reductive paradigms
    • As alternative modeling; systematically investigating structures and forms of other times and places.

Reconciling structural differences in TEI

TEI as an encoding system that negotiates

  • TEI Guidelines can be frustrating because they provide so many options to decide upon for encoding a text.
     
  • But the TEI can also construct bridges, harmonize differences between different approaches.
     
  • TEI “standoff annotation”: markup that refers to other documents or other portions of a document.
    • Can be machine-generated 
    • can provide a way to
      • to negotiate between different encoding systems
      •  to analyze what is lost in translation
      • to mark variation between distinct versions of a text.

 

Standoff Annotation: links, pointers, commentary

Related to w3C Web Annotation Model

See TEI ticket #1745: Led to the development of the <standoff> TEI element for expressing linked data

Standoff Annotation: links, pointers, commentary

  • <standoff> is not the only way to perform "standoff annotation" in TEI.
     
  • But it allows for many different kinds of annotation:

    <standoff>: “Functions as a container element for linked data, contextual information, and stand-off annotations embedded in a TEI document.”
     
  • Added to TEI Guidelines in Release 4.3.0 (August 2021)
     
  • Many projects do standoff work without using this element!

 

 

An Idea: TEI Standoff is Cosmopolitan

  • The following slides are from my own (Anglo-European) project history with the TEI, exploring forms of “standoff annotation”
    • I began studying the TEI a little over 10 years ago as a scholar of 19th-century English literature
    • The TEI led me very quickly to cosmopolitan projects and collaborations:
    • TEI community has shown me the TEI itself as a cosmopolitan coding method

 

A cosmopolitan TEI  makes itself aware of multiple formats, multiple languages, multiple approaches

Investigating translation history

  • Garci Rodríguez de Montalvo's Amadis de Gaula from 1500s translated from Castillian Spanish to modern English by Robert Southey (1803):
    • three centuries apart
    • medieval Catholic imperial Spain vs. Protestant imperial England
  • We study Southey's "sense-for-sense" rather than "word-for-word" translation and how it changed Montalvo's text.

Amadis in Translation project  (https://newtfire.org/amadis/)

Investigating translation history

Amadis in Translation project  (https://newtfire.org/amadis/)

  • This translation study applies the TEI:
    • to align Montalvo's and Southey's texts
    • to discover: What did Southey compress and remove from Montalvo? And what did he add to Montalvo?

Investigating translation history

Amadis in Translation project  (https://newtfire.org/amadis/)

<cl xml:id="M0_p1_c63">
	<milestone unit="said" resp="#Garinter" ana="start"/>No sin causa tiene/>.</cl>
 <cl xml:id="M0_p1_c64">Esto hecho recogida toda la compaña hizo en dos
                        palafrenes cargar el león y el ciervo:</cl>
<cl xml:id="M0_p1_c65">y llevarlos a la villa con gran plazer.</cl>
<cl xml:id="M0_p1_c66">Donde siendo de tal huésped la reina avisada:</cl>
<cl xml:id="M0_p1_c67">los palacios de grandes y ricos atavíos/</cl>
<cl xml:id="M0_p1_c68"><seg xml:id="M0_p1_c68_1">y las mesas puestas</seg> 
       <seg xml:id="M0_p1_c68_2">hallaron:</seg></cl>
<cl xml:id="M0_p1_c69">en la una más alta se sentaron los reyes:</cl>
<cl xml:id="M0_p1_c70">y en otra junto con ella Elisena su hija:</cl>
<cl xml:id="M0_p1_c71">y allí fueron servidos como en casa de tan buen hombre
                        ser devía.</cl>
  • TEI <cl> and <seg> elements marks units of "sense":  clauses, "clause-like" passages, and phrases  in Montalvo's Amadis. 
  • This markup is designed to provide locational markers for reference in translated files.
  • The Spanish structure of  this document does not match English sentence structure, but units of "sense" can be traced.

Investigating translation history

Amadis in Translation project  (https://newtfire.org/amadis/)

<cl xml:id="M0_p1_c63">
	<milestone unit="said" resp="#Garinter" ana="start"/>No sin causa tiene/>.</cl>
 <cl xml:id="M0_p1_c64">Esto hecho recogida toda la compaña hizo en dos
                        palafrenes cargar el león y el ciervo:</cl>
<cl xml:id="M0_p1_c65">y llevarlos a la villa con gran plazer.</cl>
<cl xml:id="M0_p1_c66">Donde siendo de tal huésped la reina avisada:</cl>
<cl xml:id="M0_p1_c67">los palacios de grandes y ricos atavíos/</cl>
<cl xml:id="M0_p1_c68"><seg xml:id="M0_p1_c68_1">y las mesas puestas</seg> 
       <seg xml:id="M0_p1_c68_2">hallaron:</seg></cl>
<cl xml:id="M0_p1_c69">en la una más alta se sentaron los reyes:</cl>
<cl xml:id="M0_p1_c70">y en otra junto con ella Elisena su hija:</cl>
<cl xml:id="M0_p1_c71">y allí fueron servidos como en casa de tan buen hombre
                        ser devía.</cl>
  • TEI <cl> and <seg> elements marks units of "sense":  clauses, "clause-like" passages, and phrases  in Montalvo's Amadis. 
     
  • This markup is designed to provide locational markers for reference in translated files.

Investigating translation history

Amadis in Translation project  (https://newtfire.org/amadis/)

<p> <!-- <s> elements preceding  -->
 <s>
  <anchor ana="start" type="add"/>When Garinter saw him fall,<anchor ana="end"/>
  <anchor ana="start" corresp="#M0_p1_c62"/>he said within himself<anchor ana="end"/>
  <anchor ana="start" corresp="#M0_p1_c63"/>not without cause is that Knight famed
                  to be the best in the world.<anchor ana="end"/>
 </s>
  <s>
   <anchor ana="start" corresp="#M0_p1_c64"/>Meanwhile their train came up, and then
     was their prey and venison laid on two horses<anchor ana="end"/>
 <anchor ana="start" corresp="#M0_p1_c65"/>and carried to the City.<anchor ana="end"/>
 </s>
</p>
  • Markup of Southey's translation:
    • @corresp attribute points to corresponding passages of sense in the Montalvo source.
    • Southey's additions are marked with @type="add".  (Omissions are demonstrated in skipped numbers—not shown in this short example.)
    • relatively "flat" document = standoff annotation of Southey's alignment with Montalvo's text.

Investigating translation history

Amadis in Translation project  (https://newtfire.org/amadis/)

Alignment table showing passages added and omitted in Southey's translation

Investigating translation history

Amadis in Translation project  (https://newtfire.org/amadis/)

Visualization (XSLT to SVG) as a diagram of aligned content, and proportions unique to Montalvo and Southey

TEI to compare versions encoded differently

 

  • Collaborating with scholars of medieval Spain (Stacey Triplette and Helena Bermudez Sabel) led us:
    • to write TEI to structure and measure "sense by sense" translation
    • to understand TEI as a language that could bridge our different fields (19c England vs 14thc Spain) through digital humanities.
       
  • Soon after, I joined a group of scholars to update and revitalize the digital representation of Mary Shelley's novel Frankenstein.
     
  • This new project also explored how TEI can construct bridges—this time between very different structural encodings in order to compare them.


新宿御苑橋
Shinjukugyoen-bashi

TEI to compare versions encoded differently

Frankenstein Variorum: https://frankensteinvariorum.org/

  • Visualizes a collation, or comparison of versions, working with digital editions that were encoded very differently
     
  • Designed as a static website for serendipitous browsing and intensive research
     
  • Applies the TEI in a JavaScript context to store comparison data and pointers to variant passages

Objectives of the Frankenstein Variorum (FV)

 

  • to “upcycle” and connect previous digital editions of Frankenstein:
    • PA Electronic Edition: 1990s HTML of two editions
    • Shelley-Godwin Archive: complicated genetic edition markup in TEI (page by page encoding of manuscript notebook)
    • new editions prepared in simple, structural TEI
  • to share a nonlinear, divergent edition history
     
  • to encourage exploration from one edition to the others

 

 

Editions that we compared in the Frankenstein Variorum

FV includes Shelley-Godwin Archive encoding

  • S-GA diplomatic edition of the 1816 Notebooks,
    • encoded surface-by-surface, line-by-line
    • To collate this edition with the others, we had to resequence the encoding of the margin notes in the text in reading order
    • This was possible thanks to the precise TEI encoding of the S-GA editors: (We could follow the pointers and markers in XSLT processing.)

Shelley-Godwin Archive: sample page surface:

Shelley-Godwin Archive

sample surface encoding from S-GA

<surface xmlns:mith="http://mith.umd.edu/sc/ns1#" lrx="3847" lry="5342" 
partOf="#ox-frankenstein_volume_i" ulx="0" uly="0" 
mith:folio="21r" mith:shelfmark="MS. Abinger c. 56" 
xml:base="https://raw.githubusercontent.com/
umd-mith/sga/master/data/tei/ox/ox-ms_abinger_c56/ox-ms_abinger_c56-0045.xml" 
xml:id="ox-ms_abinger_c56-0045">
  <graphic url="http://shelleygodwinarchive.org/images/ox/ms_abinger_c56/ms_abinger_c56-0045.jp2"/>
  <zone rend="bordered" type="pagination"><line>75</line></zone>
  <zone type="library"><line>21</line></zone>
<!-- lines of text elided here -->
<line>to form. His limbs were in proportion</line>
<line>and I had selected his features <del rend="strikethrough">h</del> as</line>
<line><mod>
        <del rend="strikethrough">handsome</del>
        <del rend="unmarked">.</del>
        <anchor xml:id="c56-0045.01"/>
      </mod>
      <mod>
        <del rend="strikethrough">Handsome</del>
        <add hand="#pbs" place="superlinear">Beautiful</add>
      </mod>; Great God! His</line>

<!-- at the end of the surface encoding, encoding material in a left-margin zone:  --->

<zone corresp="#c56-0045.01" type="left_margin">
    <line><add><mod>
          <del rend="strikethrough">handsome</del>
          <add hand="#pbs" place="superlinear">beautiful.</add>
        </mod></add></line>
  </zone>
<!-- other marginal insertions encoded -->
</surface>

Collating when the editions are so different (1)

Align and “chunk”

  • Best not to collate the entire novel files to prevent severe alignment errors!
  • We prepared 33 collation units (or "chunk files") sharing common starting and ending points.  
  • Edition files of the same chunk are collated together

Collating when the editions are so different (2)

Prescribe rules to direct the machine-assisted collation
 

  • Our Python collation script 
    • works with collateX library, extensively customized
    • Prepares collateX to work around markup differences
      •   (identify and unite words split around line-endings in S-GA)
    • to identify what features can be ignored/skipped over for collation purposes
      • (e.g. markup of pagination, line-by-line encoding in S-GA)
    • to normalize: identify what apparently different features are the same:
      • <milestone type='paragraph'> is same as <p>
      •  "&" is not different from "and"  
    •  Prescribes output in form of TEI critical apparatus :
  • Markup of text structure compared across Variorum:  
    • Volume (print editions only), letter, chapter
    • Paragraph, poetry line-groups and lines
    • Notes
  • Markup of manuscript events included in Variorum comparison: deletion, insertion, gap
  • Normalizing algorithm:
    • Decide what marks are equivalent)
    • Ignore but preserve other markup in collation process, also abbreviations, capitalization.  

Including markup in the comparison

Manuscript (from Shelley-Godwin Archive):

<lb n="c56-0045__main__2"/>It was on a dreary night of November 
<lb n="c56-0045__main__3"/>that I beheld <del rend="strikethrough" 
xml:id="c56-0045__main__d5e9572">
       <add hand="#pbs" place="superlinear" xml:id="c56-0045__main__d5e9574">the frame on
         whic</add></del> my man comple<del>at</del>
<add place="intralinear" xml:id="c56-0045__main__d5e9582">te</add>
<add xml:id="c56-0045__main__d5e9585">ed</add>

1818 (from PA Electronic edition)

<p xml:id="novel1_letter4_chapter4_div4_div4_p1">I<hi>T</hi> was on a dreary 
night of November, that I beheld the accomplishment of my toils.</p>
  • What matters for meaningful comparison?
    • Text nodes
    • <del> and <p> markup
  • What doesn't matter?
    • <lb/> elements, attribute nodes
    • <hi>? *In real life we include the <hi> elements as meaningful markup because sometimes they are meaningful for emphasis.

Tokenization and alignment of TEI markup

MS (from Shelley-Godwin Archive):

["It", "was", "on", "a", "dreary", 
"night", "of". "November", "that", 
"I", "beheld" 
"&lt;del&gt;the frame on whic&lt;/del&gt;",
"my", "man", 
"comple", "&lt;del&gt;at&lt;/del&gt;", "teed"]

1818 (from PA Electronic edition)

["&lt;p&gt;", "IT", "was", "on", "a", "dreary", 
"night", "of", "November,", "that", "I", "beheld",
"the", "accomplishment", "of", "my", "toils.", "&lt;/p&gt;"]

Project decision: Treat a deletion as a complete and indivisible event:

a ”long token”. This helps to align other witnesses around it.

TEI critical apparatus code

can be a data structure that builds a bridge between differently coded editions

<app>
	<rdgGrp n="['that', 'i', 'beheld']">
		<rdg wit="f1818">that I beheld</rdg>
		<rdg wit="f1823">that I beheld</rdg>
		<rdg wit="fThomas">that I beheld</rdg>
		<rdg wit="f1831">that I beheld</rdg>
		<rdg wit="fMS">&lt;lb n="c56-0045__main__3"/&gt;that I beheld</rdg>
	</rdgGrp>
</app>
<app>
	<rdgGrp n="['&lt;del&gt; the frame on whic&lt;/del&gt;',
               'my', 'man', 'comple', 
               '', '&lt;mdel&gt;at&lt;/mdel&gt;', 'te', 'ed', 
               ',', '.', '&lt;del&gt;and&lt;/del&gt;']">
		<rdg wit="fMS">&lt;del rend="strikethrough" 
          xml:id="c56-0045__main__d5e9572"&gt;
			&lt;sga-add hand="#pbs" place="superlinear" 
          sID="c56-0045__main__d5e9574"/&gt;the
	      frame on whic &lt;sga-add eID="c56-0045__main__d5e9574"/&gt; &lt;/del&gt; my man
		  comple &lt;mod sID="c56-0045__main__d5e9578"/&gt; 
          &lt;mdel&gt;at&lt;/mdel&gt;
		  &lt;sga-add place="intralinear" sID="c56-0045__main__d5e9582"/&gt;te
          &lt;sga-add eID="c56-0045__main__d5e9582"/&gt;
          &lt;sga-add sID="c56-0045__main__d5e9585"/&gt;ed
		  &lt;sga-add eID="c56-0045__main__d5e9585"/&gt;
          &lt;mod eID="c56-0045__main__d5e9578"/&gt;
          &lt;sga-add hand="#pbs" place="intralinear"sID="c56-0045__main__d5e9588"/&gt;, 
          &lt;sga-add eID="c56-0045__main__d5e9588"/&gt;.
		  &lt;del rend="strikethrough"
		  xml:id="c56-0045__main__d5e9591"&gt;And&lt;/del&gt;</rdg>
	</rdgGrp>
	<rdgGrp n="['the', 'accomplishment', 'of', 'my', 'toils.']">
		<rdg wit="f1818">the accomplishment of my toils.</rdg>
		<rdg wit="f1823">the accomplishment of my toils.</rdg>
		<rdg wit="fThomas">the accomplishment of my toils.</rdg>
		<rdg wit="f1831">the accomplishment of my toils.</rdg>
	</rdgGrp>
</app>

From collation data to spine

 

  • “Spine” = data model (dynamic nerve plexus?) holding the variorum together
    • standoff use of TEI critical apparatus
      • coordinates data on variance, including normalized tokens and maximum edit-distance values 
    • points to specific locations in the variorum edition files

An interesting variant passage in Frankenstein

A Thomas copy edit of Letter IV at an early moment of intense revision

Another interesting variation in Frankenstein

where the Creature comes to life in MS and Thomas

How did we make this heatmap?

We made it from the "Standoff Spine" collation data:
XSLT transformation of TEI to SVG

See our Method page for details:

https://frankensteinvariorum.org/

 

So far this presentation has praised the TEI as “cosmopolitan”

Tokyo International Forum

Photo credit: David B. Cox, photographylife.com

  • for being able to mark and study texts in ways that reflect their culture, not just ours
  • for building bridges between different language and encoding structures

Writing systems that challenge the limits of computer technology and the TEI

But I want to conclude by introducing a a difficult problem:

TEI is severely limited for right-to-left (RTL) scripts

  • right-to-left scripts for Arabic and Hebrew are problematic when marked in left-to-right TEI XML markup
  • Till Grallert spoke of this at the 2023 TEI keynote address in Paderborn, Germany
    •  See Open Arabic Periodical Editions Project
    • Grallert told us of problems with Arabic typesetting, publishing machinery  that cannot easily encompass the variety of symbols and glyphs for moving type.
  • Editing solutions are a compromise.
    • oXygen XML Editor, Kate, VS Code offer some custom helpful but imperfect editing solutions for inline markup
    • TEI and XML itself interfere with Unicode directionality of the RTL text contents
<p xml:id="p_23.d1e747" xml:lang="ar">وما برحت الآمال معقودة بأن تبلغ الصحافة عما قليل أشدها ورشدها 
   <lb change="#d2e635 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_9.d2e1448"/>
   لتضاهي صحافة الأمم الراقية في موضوعاتها وتأثيراتها إذ أن العقلاء يذهبون 
    <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_10.d2e1647"/>
    إلى أن صحافتنا مازالت حالها على ما انتهت إليه غير متناسبة مع عمرها الطويل. 
    <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_11.d2e1649"/>
   والمعمر في الأعم من حالاته يشتد ساعده وزنده وتقوى ملكة عقله وعلمه 
   <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_12.d2e1651"/>
    بكثرة تجاربه وأسباب رويته. ولا خير في أمة لا يقوم بشؤونها شيوخ 
    <lb change="#d2e808 #d2e861" ed="print" edRef="#edition_1" xml:id="lb_13.d2e1653"/>
    تفاخر بأعمالهم مفاخرتها بعقولهم وطول أعمارهم.</p>
  • Sample encoding from Till Grallert, Digital edition (TEI XML) of the Arabic monthly journal *al-Muqtabas* (مجلة المقتبس), published by Muḥammad Kurd ʿAlī in Cairo and Damascus between 1906 and 1917/18
  • https://github.com/openarabicpe/journal_al-muqtabas
  • Standoff methods can help, but not a complete solution to customizing the TEI

 

 

Can the TEI provide alternatives to its international English tagging?

  • Hugh Cayless (in a working draft) suggests we have a lot of work to do.
    • What if TEI found a way to express its tags and their directionality to match the document characters?
    • What if we had software to process it? (Would this be an alternate form of XML? Or pushing XML beyond Right-to-Left directionality?)
       
  • The problem is larger than the TEI, but it is TEI encoders who may find the path to a more inclusive method of encoding!

Thank you for listening! Any questions?

ご清聴ありがとうございました!

何か質問はありますか?

Go seichō arigatōgozaimashita! Nanika shitsumon wa arimasu'ka?

The Internationalization of the TEI (March 10, 2025)

By Elisa Beshero-Bondar

The Internationalization of the TEI (March 10, 2025)

The internationalization of the TEI and our understanding of text

  • 12