Elisa Beshero-Bondar PRO
Professor of Digital Humanities and Chair of the Digital Media, Arts, and Technology Program at Penn State Erie, The Behrend College.
LISTSERV ⇒ TEI
Elisa Beshero-Bondar
Penn State Erie
Link to these slides:
Syd Bauman
Northeastern University
Jue 10 Oct 24 15:30
New TEI module for computer mediated communication
TEI has significant sets of data from computer communication platforms
Nuevo módulo TEI para comunicación mediada por computadora
TEI cuenta con importantes conjuntos de datos proveniente de plataformas de comunicación informática
CMC not intended specifically for e-mail conversations
but neither is <correspDesc>
“We choose to do [these] things, not because they are easy, but because they are hard; because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept” — John F. Kennedy, 1962-09-12
CMC no está destinada específicamente para conversaciones por correo electrónico
pero tampoco lo está <correspDesc>
Escogimos hacer estas cosas, no porque sean fáciles, sino porque son difíciles, porque ese objetivo servirá para organizar y medir lo mejor de nuestras energías y habilidades, porque ese desafío es uno que estamos dispuestos a aceptar, John F. Kennedy, 1962-09-12
Las listas de correo electrónico son un desafío
E-mail lists are challenging:
We chose to start with TEI-L
large, interesting, important historical dataset
LISTSERV provides easy access to the data in raw form
we had recently been working with it (to move from Brown to PSU)
Decidimos empezar con TEI-L
conjunto de datos históricos interesante y grande
LISTSERV proporciona acceso fácil a los datos en su forma bruta
hemos estado recientemente trabajado con él (para transferirlo desde Brown a PSU)
Initial thought:
converting data to XML will be easy
deciding exactly how to encode in TEI will be interesting
Turns out both parts were problematic
Pensamiento inicial:
convertir los datos a XML será fácil
decidir exactamente cómo codificar en TEI será interesante
Resulta que las dos partes han sido problemáticas
Syd — preparing the source
Syd – preparando la fuente
Goal — convert all of TEI-L up to 2024-04-29
Obtained data directly from Brown listmaster
412 separate files, one for each month, i.e.
tei-l.log9001
through tei-l.log2404
Renamed files to use 4-digit year so they sorted into right order
Meta — convertir toda la TEI-L hasta 2024-04-29
Obtuvimos los datos directamente de la listmaster de Brown
412 archivos individuales, uno para cada mes, i.e. de
tei-l.log9001
a tei-l.log2404
Renombramos los archivos usando cuatro dígitos para el año, de manera de ordenarlos correctamente
Early e-mail systems used ASCII (7-bit) characters only
Of those 128 available characters, only 99 of them are legal XML characters
Of the 29 characters that are not legal in XML, 17 of them occur in the TEI-L archives
64 occurrences in 34 posts before 2000 (not surprising)
38 occurrences in 11 posts after 2008 (surprised me)
the last 2 of which are in a single post by David Maus on 2016-03-13
He was copying what someone else wrote
Sistemas de e-mail tempranos usaban solamente caracteres ASCII (7-bit)
De los 128 caracteres disponibles, sólo 99 son caracteres legales en XML
De los 29 caracteres no permitidos en XML, se encuentran 17 en los archivos TEI-L
64 instancias en 34 posts antes de 2000 (no sorprendente)
38 instancias en in 11 posts después de 2008 (sí me sorprendió)
Los dos últimos ejemplares se encuentran en un solo post de David Maus 2016-03-13
Él copió lo que otra persona escribió
These are not invalid characters, so iconv -c
or oXygen’s “Encoding errors handling” feature are not helpful
Delete them, leaving users to figure out missing characters
Replace them with another character(s), thus marking the location of the problematic character
For each case try to figure out what character the author intended, and replace with the proper UTF-8 character
possibly preserving information about the original encoding
No son caracteres inválidos, por lo tanto iconv -c
no es útil ni tampoco lo es la función de oXygen “Encoding errors handling”
Borrarlos y permitir a los usuarios averiguar los caracteres que faltan
Reemplazarlos con otro carácter(es), marcando la ubicación del carácter problemático
En cada caso, tratar de averiguar cuál era el caracter deseado por el autor, y reemplazarlo por el carácter UTF-8 indicado
Posiblemente preservando información sobre la codificación original
Take "core samples" (all the messages in a short range of months or years) from the TEI-L to figure out how to model them in TEI
Successfully extracted URLs for messages from the TEI-L web archive, but parsing the messages failed. (Abandoned for the conference due to time constraints, but we'd like to return to it. . .)
Tomar "muestras centrales" de la TEI-L para descubrir cómo modelarlas en TEI
Se extrajeron correctamente las URL de los mensajes del archivo web de TEI-L, pero falló el análisis de los mensajes. (Abandonado para la conferencia por falta de tiempo, pero nos gustaría volver al tema...)
Some Complexities
Algunas complejidades
Commonly used in header field contents
=?charset?encoding?text?=
E.g., =?ISO-8859-2?Q?Piotr_Ba=F1ski?=
=?UTF-8?Q?Piotr_Ba=C5=84ski?=
Leave as-is, or convert to actual characters?
Piotr Bański
usado con frecuencia en los contenido de campos de cabeceras
=?charset?encoding?text?=
P. ej., =?ISO-8859-2?Q?Piotr_Ba=F1ski?=
=?UTF-8?Q?Piotr_Ba=C5=84ski?=
¿dejarlo como está, o convertirlo a caracteres de verdad?
Piotr Bański
Often involves repeating content of mail in two (or more) parts:
common: text/plain, text/html
rare: text/enriched, text/x-vcard, text/xml, application/*
Each part (regardless of type) is also encoded using a declared character set
common: UTF-8, ISO-8859-*, US-ASCII, Windows-1252
rare: macintosh, big5, Windows-1255, euc-kr, unknown
Involucra con frecuencia la repetición del contenido del correo en dos partes (o más):
habitual: text/plain, text/html
poco frecuente: text/enriched, text/x-vcard, text/xml, application/*
Cada parte (independiente del tipo) también es codificada usando un conjunto de caracteres declarado
common: UTF-8, ISO-8859-*, US-ASCII, Windows-1252
rare: macintosh, big5, Windows-1255, euc-kr, unknown
Because I've read it (I think in the late eigthies) in the=20
"Gentle Introduction". And the phrase is already there
in the World Wide Web as we see by google-ing=20
"making explicit what is conjectural or implicit"
with a lot of derived occurences.
Ask some elder man from the TEI board
(or encoding philosopher Lachance)
if my understanding of TEIs approach to encoding
is so odd as my Englisch ;-)
Best regards,
Herbert
--part1_97.5661ed07.2f108bb9_boundary
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable
<HTML><FONT FACE=3Darial,helvetica><HTML><FONT SIZE=3D3 PTSIZE=3D12 FAMILY=
=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2">In einer eMail vom 07.01.05 11:30:2=
0 (MEZ) Mitteleurop=E4ische Zeit schreibt sebastian.rahtz@COMPUTING-SERVICES=
.OXFORD.AC.UK:<BR>
<BR>
</FONT><FONT COLOR=3D"#000000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR:=20=
#ffffff" SIZE=3D2 PTSIZE=3D10 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2"=
><BR>
<BLOCKQUOTE TYPE=3DCITE style=3D"BORDER-LEFT: #0000ff 2px solid; MARGIN-LEFT=
: 5px; MARGIN-RIGHT: 0px; PADDING-LEFT: 5px">Thats an odd assertion. Why do=20=
you think the TEI mandates<BR>
"Make explicit the implicit" ?<BR>
</BLOCKQUOTE><BR>
</FONT><FONT COLOR=3D"#000000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR:=20=
#ffffff" SIZE=3D3 PTSIZE=3D12 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2"=
><BR>
Because I've read it (I think in the late eigthies) in the <BR>
"Gentle Introduction". And the phrase is already there<BR>
in the World Wide Web as we see by google-ing <BR>
"</FONT><FONT COLOR=3D"#000000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR:=
#ffffff" SIZE=3D2 PTSIZE=3D10 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2=
">making explicit what is conjectural or implicit</FONT><FONT COLOR=3D"#000=
000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR: #ffffff" SIZE=3D3 PTSIZE=
=3D12 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2">"<BR>
with a lot of derived occurences.<BR>
<BR>
Ask some elder man from the TEI board<BR>
(or encoding philosopher Lachance)<BR>
if my understanding of TEIs approach to encoding<BR>
is so odd as my Englisch ;-)<BR>
<BR>
Best regards,<BR>
Herbert<BR>
<BR>
<BR>
</FONT></HTML>
--part1_97.5661ed07.2f108bb9_boundary--
Email servers, etc., use at most 8-bits per character (i.e. 256 chars)
Thus every character set that uses any of the other 154,807 characters must be down-translated into only those 256
quoted-printable
base64
No XPath functions to translate back (EXPath for base64)
Servidores de email, etc., usan como mucho 8-bits por caracter (i.e. 256 caracteres)
Así que cada conjunto de caracteres que usa cualquier de los otros 154,807 tiene que convertirse a esos 256
quoted-printable
base64
No hay funciones XPath para revertir (EXPath para base64)
Q2lhcsOhbuKAmXMgb2JzZXJ2YXRpb24gZG9lcyBub3Qgc3F1YXJlIHdpdGggb3VyIGV4cGVyaWVu
Y2UgaW4gdGhlIEVhcmx5UHJpbnQgcHJvamVjdC4gIENvbnNpZGVyIOKAmGhhbmRzb21lLCBjbGV2
ZXIsIGFuZCByaWNo4oCZIGZyb20gdGhlIG9wZW5pbmcgc2VudGVuY2Ugb2YgRW1tYS4gVGhlcmUg
bWF5IGJlIG9jY2FzaW9ucyB3aGVyZSB5b3Ugd2FudCB0byBpZGVudGlmeSBwaHJhc2VzIGxpa2Ug
dGhhdCBpbiBzb21lIG90aGVyIGNvcnB1cy4NCg0KV2VsbCwgZ28gdG8gaHR0cDovL2JsYWNrbGFi
LmVhcmx5cHJpbnQub3JnL2NvcnB1c3NlYXJjaC8gYW5kIGVudGVyIHRoZSBzZWFyY2ggdGVybQ0K
DQpbcG9zPSJqIl1bcG9zPSJqIl1bImFuZCJdW3Bvcz0iaiJdDQoNCldpdGhpbiBzZWNvbmRzIGl0
IHdpbGwgcmV0cmlldmUgOCwyOTEgbWF0Y2hlcyBmcm9tIHRleHRzIGJldHdlZW4gMTY0MCBhbmQg
MTY2MCwgbXkgcGVyc29uYWwgZmF2b3VyaXRlIGJlaW5nIOKAnHRoZSBTY290dGlzaCBncm93ZXMg
ZHVsbGUsIEZyb3N0aWUsIGFuZCB3YXl3YXJkLuKAnQ0KDQpJIGFtIHRvbGQgYnkgUGhpbCBCdXJu
cywgd2hvIGtub3dzIGEgbG90IGFib3V0IHRoZXNlIHRoaW5ncywgdGhhdCB0aGUgQmxhY2tsYWIg
c2VhcmNoIGVuZ2luZSBpcyByZWxhdGl2ZWx5IGVhc3kgdG8gaW5zdGFsbC4gSXQgYWxzbyBzdXBw
b3J0cyBpbmNyZW1lbnRhbCBpbmRleGluZywgd2hpY2ggaXMgYSBiaWcgaGVscC4gIFRoZSBjdXJy
ZW50IHVzZXIgaW50ZXJmYWNlIGlzIHZlcnkgU3BhcnRhbiwgYW5kIGEgdXNlciBoYXMgdG8ga25v
dyB0aGUgdGFnIHNldCBvbiB3aGljaCB0aGUgc2VhcmNoZXMgYXJlIGJhc2VkLiBCbGFja2xhYiBp
cyBlbGVtZW50IGF3YXJlIGluIHNpbXBsZSB3YXlzIHRoYXQgd2lsbCBzdXBwb3J0IG1hbnkgb2Yg
dGhlIHVzZXMgdGhhdCBjb21lIHVwIGluIGxpdGVyYXJ5IHNjaG9sYXJzaGlwLiBGb3IgaW5zdGFu
Y2UsIHlvdSBjYW4gbG9vayBmb3IgYWRqZWN0aXZlcyBiZWZvcmUg4oCYbGliZXJ0eeKAmSBpbiBw
b2V0cnkuIEFuZCBzbyBvbi4NCg0KDQoNCkZyb206ICJURUkgKFRleHQgRW5jb2RpbmcgSW5pdGlh
dGl2ZSkgcHVibGljIGRpc2N1c3Npb24gbGlzdCIgPFRFSS1MQExJU1RTRVJWLkJST1dOLkVEVT4g
b24gYmVoYWxmIG9mIFNlcmdlIEhlaWRlbiA8c2xoQEVOUy1MWU9OLkZSPg0KT3JnYW5pemF0aW9u
OiBFTlMgZGUgTHlvbg0KUmVwbHktVG86IFNlcmdlIEhlaWRlbiA8c2xoQEVOUy1MWU9OLkZSPg0K
RGF0ZTogVHVlc2RheSwgQXByaWwgMywgMjAxOCBhdCA4OjMwIEFNDQpUbzogIlRFSSAoVGV4dCBF
bmNvZGluZyBJbml0aWF0aXZlKSBwdWJsaWMgZGlzY3Vzc2lvbiBsaXN0IiA8VEVJLUxATElTVFNF
UlYuQlJPV04uRURVPg0KU3ViamVjdDogUmU6IDxjPiB0YWcNCg0KSGkgQ2lhcsOhbiwNCg0KTGUg
MjkvMDMvMjAxOCDDoCAwMDo1MCwgQ2lhcsOhbiDDkyBEdWliaMOtbiBhIMOpY3JpdCA6DQoNCi4u
Lg0KT3ZlcmFsbCBteSBjb25jbHVzaW9uIGlzIHRoYXQgdGhlcmUgaXMgbGl0dGxlIHBvaW50IGlu
IGNvbnZlcnRpbmcgdG8gVEVJIGEgY29ycHVzIG9mIHRleHRzIGludGVuZGVkIGZvciBpbmRleGlu
Zy9yZXRyaWV2YWwsIGFzIGl0IGRvZXMgbm90IG1lYW4gdGhleSBjYW4gYmUgZWFzaWx5IHVzZWQg
d2l0aCBtb3JlIGFwcGxpY2F0aW9ucyBhbmQgb24gbW9yZSBwbGF0Zm9ybXMuICBJZiBYYWlyYSBo
YWQgY29udGludWVkIHRvIGJlIGRldmVsb3BlZCwgdGhpcyBtaWdodCBoYXZlIGJlZW4gZGlmZmVy
ZW50Lg0KLi4uDQoNClRoYW5rIHlvdSBmb3IgdGhlIHJlcG9ydCBvbiB0aGUgYXBwbGljYXRpb25z
Lg0KDQpXaGF0IHdvdWxkIGhlbHAgYSBsb3Qgd291bGQgYmUgdG8gbGlzdCBleHBsaWNpdGx5IHNv
bWUgc2VydmljZXMgb3IgZmVhdHVyZXMgb2YgWGFpcmEgdXNlZnVsIG9yIG5lY2Vzc2FyeSBmb3Ig
eW91IHRoYXQgYXJlIG5vdCBmb3VuZCBpbiB0aGUgc29mdHdhcmUgZGlzY3Vzc2VkLiBTb21laG93
IHRoZSByZWxldmFudCBmZWF0dXJlcyBvZiBYTUwgZWRpdG9ycyBmb3IgdGVhY2hpbmcgaGF2ZSBi
ZWVuIGRpc2N1c3NlZCBhbmQgc3ludGhlc2l6ZWQgaGVyZTogaHR0cHM6Ly93aWtpLnRlaS1jLm9y
Zy9pbmRleC5waHAvRWRpdG9yX2Zvcl90ZWFjaGluZ19URUlfLV9mZWF0dXJlczxodHRwczovL3Vy
bGRlZmVuc2UucHJvb2Zwb2ludC5jb20vdjIvdXJsP3U9aHR0cHMtM0FfX3dpa2kudGVpLTJEYy5v
cmdfaW5kZXgucGhwX0VkaXRvci01RmZvci01RnRlYWNoaW5nLTVGVEVJLTVGLTJELTVGZmVhdHVy
ZXMmZD1Ed01GYVEmYz15SGxTMDRIaEJyYWVzNUJROXVldTV6S2hFN3J0Tlh0X2QwMTJ6MlBBNndz
JnI9ckc4enhPZHNzcVN6RFJ6NHgxR0xsbUxPVzYweHlWWHlkeHduSlpwa3hiayZtPW94bFkwVFI1
LTRVWnVaY0F1b0FzWnh1bnNwM1JXNEl5RDhqazBhdmlYVGMmcz1FRkJhdUlnQmZObkNqU2ZINHpp
NzZxbC1Lc1NFR1dVc05hQk5uemphdHpFJmU9Pg0KDQpCZXN0LA0KU2VyZ2UNCg0KLS0NCg0KRHIu
IFNlcmdlIEhlaWRlbiwgc2xoIEFUIGVucy1seW9uLmZyLCBodHRwOi8vdGV4dG9tZXRyaWUuZW5z
LWx5b24uZnI8aHR0cHM6Ly91cmxkZWZlbnNlLnByb29mcG9pbnQuY29tL3YyL3VybD91PWh0dHAt
M0FfX3RleHRvbWV0cmllLmVucy0yRGx5b24uZnImZD1Ed01GYVEmYz15SGxTMDRIaEJyYWVzNUJR
OXVldTV6S2hFN3J0Tlh0X2QwMTJ6MlBBNndzJnI9ckc4enhPZHNzcVN6RFJ6NHgxR0xsbUxPVzYw
eHlWWHlkeHduSlpwa3hiayZtPW94bFkwVFI1LTRVWnVaY0F1b0FzWnh1bnNwM1JXNEl5RDhqazBh
dmlYVGMmcz1PRWRCQWsta05GZEJlRkFaUUY3Qk5MVG9oc0x1Q3JEU09PUG1zSkU3WVNVJmU9Pg0K
DQrDiXF1aXBlIGRlIHJlY2hlcmNoZSBDYWN0dXMsIGxhYm9yYXRvaXJlIElIUklNIFVNUjUzMTcs
IEVOUyBkZSBMeW9uDQoNCjE1LCBwYXJ2aXMgUmVuw6kgRGVzY2FydGVzIDY5MzQyIEx5b24gQlA3
MDAwIENlZGV4LCB0w6lsLiArMzMoMCk2MjIwMDM4ODMNCg==
It is common for the same content to be presented in multiple content types
text/plain
text/html (what to do?)
rarely (19 of 25,415) other content types
Regardless of content type, it may be encoded as 7-bit, 8-bit, quoted-printable, or base64 in any of various character encodings
Frecuentemente, el mismo contenido es presentado en múltiples tipo de contenido
text/plain
text/html (¿qué hacer?)
rara vez (19 de 25,415) otros tipos de contenido
Con independencia del tipo de contenido, se puede codificar como 7-bit, 8-bit, quoted-printable, o base64 en cualquiera de los varios tipos de codificación de caracteres
cuantos campo descripción
30,995 Date fecha
30,995 From de
30,995 Reply-To responder a
30,995 Sender remitente
30,883 Subject asunto (o tema)
28,687 MIME-Version versión MIME*
28,677 Content-Type tipo de contenido
19,896 In-Reply-To en respuesta a
19,568 Content-Transfer-Encoding codificación de transferencia de contenido
14,270 Comments comentarios
12,757 Message-ID identificador de mensaje
3,285 Organization organización
591 Content-Disposition disposición del contenido
22 X-cc copia al carbón (no estándar)
* extensiones de correo de internet multipropósito
Date ⇒ post/@when From ⇒ post/@who Reply-To ⇒ correspDesc/correspContext/ptr[@type="reply-to"] Sender ⇒ correspDesc/correspAction[@type="relayed"]/email Subject ⇒ post/head In-Reply-To ⇒ correspDesc/correspContext/ref[@type="in-reply-to"] Comments [1] ⇒ correspDesc/correspAction[@type="orig-(to|cc)"]/email Message-ID ⇒ idno[@type="message-id"] Organization ⇒ correspAction[@type="sent"]/orgName X-cc [1] ⇒ correspDesc/correspAction[@type="orig-cc"]/email [1, es] El campo X-cc: y la gran mayoría de los campos Comentarios: dan el valor del campo To: o CC: original. [1, en] The X-cc: field and the vast majority of the Comments: field give the original To: or CC: field value.
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:tmp="http://www.wwp.neu.edu/temp/ns"
xmlns:xi="http://www.w3.org/2001/XInclude"
n="00005"
xml:id="TEI-L.txt_msg_00005">
<teiHeader>
<fileDesc>
<titleStmt>
<title>TEI-L posting </title>
</titleStmt>
<publicationStmt>
<ab>Currently an experimental document not intended
for publication. That said, the original source was
published as part of a public mailing list, so this
document similarly is publicly available.</ab>
</publicationStmt>
<sourceDesc>
<ab type="desc">The source text file was (or at least should have been)
a log file from a LISTSERV mailing list.</ab>
<ab type="filepath">/tmp/LISTSERV_to_TEI/TEI-L.txt</ab>
</sourceDesc>
</fileDesc>
<encodingDesc>
<appInfo>
<application ident="listserv_log2cmc.xslt" version="0.1">
<desc>This file generated 2024-09-29T13:53:12.055127649-04:00 by file:/home/syd/Documents/tei-work/MM2024_Buenos_Aires/listserv_log2cmc.xslt,
a program intended to convert LISTSERV logs (i.e., an archive of
postings to a LISTSERV mailing list) to TEI, using /tmp/LISTSERV_to_TEI/TEI-L.txt as input.</desc>
</application>
</appInfo>
</encodingDesc>
<profileDesc>
<correspDesc>
<correspContext>
<ptr type="reply-to" target="mailto:TEI-L@UICVM"/>
</correspContext>
<correspAction type="relayed">
<email>TEI-L@UICVM</email>
<date>Tue, 6 Feb 90 10:54:04 CST</date>
</correspAction>
<note type="Comments">"ACH / ACL / ALLC Text Encoding Initiative"</note>
</correspDesc>
</profileDesc>
<xenoData>
<tmp:Date>Tue, 6 Feb 90 10:54:04 CST</tmp:Date>
<tmp:Reply-To>Text Encoding Initative public discussion list <TEI-L@UICVM></tmp:Reply-To>
<tmp:Sender>Text Encoding Initative public discussion list <TEI-L@UICVM></tmp:Sender>
<tmp:Comments>"ACH / ACL / ALLC Text Encoding Initiative"</tmp:Comments>
<tmp:From>Michael Sperberg-McQueen 312 996-2477 -2981 <U35395@UICVM.BITNET></tmp:From>
<tmp:Subject>compound documents</tmp:Subject>
</xenoData>
</teiHeader>
<text>
<body>
<post who="/tmp/LISTSERV_to_TEI/TEI-L.xml#u35395·@·uicvm.bitnet"
when="1990-02-06T10:54:04-06:00">
<head type="subject">compound documents</head>
About compound documents in SGML and in the TEI. R.P. Weber asked a
<lb n="1"/>week ago "could someone please explain the TEI approach to compound
<lb n="2"/>documents and images? WIll SGML be used here, and if so, how?"
<lb n="3"/>
<lb n="4"/>Apologies for my delay in answering. I was hoping one of our hypertext
<lb n="5"/>sages might weigh in with a reply. (But he appears to have been in the
<lb n="6"/>Caribbean, and may not have received the query.)
<lb n="7"/>
<lb n="8"/>This problem has not, in fact, been discussed on this server, or as far
<lb n="9"/>as I'm aware by the working committees. So the formal answer is that no
<lb n="10"/>decision has been cast in concrete yet. Which allows me to turn the
<lb n="11"/>tables and say "How *should* compound documents be encoded for
<lb n="12"/>interchange? What are the requirements? What are the alternatives?"
<lb n="13"/>
<lb n="14"/>Less formally, I can offer some personal opinions, for what they are
<lb n="15"/>worth (face value: two cents -- 2u, for those of you on IBM mainframes
<lb n="16"/>with real IBM terminals).
<lb n="17"/>
<lb n="18"/>Certainly SGML is where we should start in any search for ways of
<lb n="19"/>handling compound objects, and I don't yet know any reason that SGML
<lb n="20"/>won't provide a solution for the problem. I assume there are two
<lb n="21"/>methods of using SGML in compound documents (correct me if I'm wrong):
<lb n="22"/>(1) use SGML to organize the compound document (i.e. have an SGML
<lb n="23"/>document which includes text, images, sound, etc. as its components), or
<lb n="24"/>(2) use whatever-you-like to organize the compound document as a whole,
<lb n="25"/>and use SGML as the notation for the textual components of the compound
<lb n="26"/>document.
<lb n="27"/>
<lb n="28"/>For the simple case of text-with-illustrations, SGML seems like a viable
<lb n="29"/>encoding mechanism (for the envelope and for the text components) to me.
<lb n="30"/>It allows you to encode the graphics however you like, declaring your
<lb n="31"/>graphics format as a non-SGML notation and declaring the contents of
<lb n="32"/>your graphics elements (say 'PICTURE' or 'BLORT') as being data in that
<lb n="33"/>notation, stored either within the SGML file or externally to it. You
<lb n="34"/>get localization of the graphics within the text stream, integral or
<lb n="35"/>separate storage of the graphics, and complete freedom to choose
<lb n="36"/>whatever graphic notation you wish.
<lb n="37"/>
<lb n="38"/>As document encoding methods go, SGML is fairly hospitable to graphics
<lb n="39"/>and other non-text pieces of compound objects. Nowhere in the standard
<lb n="40"/>does it say that the data have to be words and characters. In fact, as
<lb n="41"/>far as I know there is no *explicit* requirement in the standard that an
<lb n="42"/>SGML document even has to be bytes in a computer. (Sure, it's hard to
<lb n="43"/>understand the standard any other way, but that's not the same as an
<lb n="44"/>explicit requirement.) ISO 8879 par. 6.1 note 1 says in fact "This
<lb n="45"/>International Standard does not constrain the physical organization of
<lb n="46"/>the document within the data stream, message handling protocol, file
<lb n="47"/>system, etc., that contains it." At the SGML '89 conference last
<lb n="48"/>October in Atlanta, there was a very nice paper by Douglas MacLeod (read
<lb n="49"/>by Yuri Rubinsky) thinking about architectural designs as SGML
<lb n="50"/>documents, which led to a general discussion of SGML definitions for all
<lb n="51"/>sorts of objects, including automobiles. Although most people
<lb n="52"/>(obviously) think of the SGML document as an electronic *description* of
<lb n="53"/>the automobile (and the physical automobile as a side effect of
<lb n="54"/>processing), it appears, in the light of the passage cited, hard to say
<lb n="55"/>categorically that an automobile itself could never be parsed as an SGML
<lb n="56"/>document. (If you could figure out how to define the delimiters.)
<lb n="57"/>
<lb n="58"/>The only hitch is that the SGML standard itself (ISO 8879) does not
<lb n="59"/>specify in any detail what the interface between SGML processors and
<lb n="60"/>non-SGML processors must, may, or can look like -- an advantage, if you
<lb n="61"/>will, in that it doesn't constrain anyone to an inappropriate model, but
<lb n="62"/>a bit of a disadvantage in that most people don't have a clue what they
<lb n="63"/>can now or will eventually or might someday be able to do with SGML and
<lb n="64"/>graphics processors.
<lb n="65"/>
<lb n="66"/>Not being deeply involved in graphics work or compound documents myself,
<lb n="67"/>I don't know off-hand what options are offered for this sort of thing by
<lb n="68"/>existing SGML processors. There will certainly be a fierce market
<lb n="69"/>demand for it, not only from humanists but also (to our great advantage)
<lb n="70"/>from the defense industry, which needs SGML support for technical
<lb n="71"/>manuals with diagrams (and of course cross-references and other
<lb n="72"/>hypertext mechanisms) and has the small change to pay for the
<lb n="73"/>development costs. (As long as they don't charge the humanists
<lb n="74"/>defense-contractor prices!)
<lb n="75"/>
<lb n="76"/>If for some reason one does *not* want to use SGML as the envelope for
<lb n="77"/>the entire compound document, then presumably the major requirement for
<lb n="78"/>the text-components of the compound documents is that they be
<lb n="79"/>computationally well-behaved, with a clearly defined structure, hooks
<lb n="80"/>for pointers going out, and hooks for pointers coming in. SGML
<lb n="81"/>certainly has all of this, in its document type declarations and its ID
<lb n="82"/>names and its IDREF pointers.
<lb n="83"/>
<lb n="84"/>Perhaps those subscribers to this list who actually work with compound
<lb n="85"/>documents and SGML will be willing to say how they make things work now,
<lb n="86"/>and how they would like to see things developing in the future.
<lb n="87"/>
<lb n="88"/>All this is, I repeat, just personal opinion and shouldn't be taken as
<lb n="89"/>defining "the" position of the TEI. (Unless, of course, taking as "the"
<lb n="90"/>position will help get a discussion started.)
<lb n="91"/>
<lb n="92"/>-Michael Sperberg-McQueen
<lb n="93"/> University of Illinois at Chicago</post>
</body>
</text>
</TEI>
teiHeader
TEI Listerv Original Server Location, Logfiles => fileDesc/sourceDesc
Method of extraction => encodingDesc/samplingDecl
text/body
Monthly Log collection => text/body/div[@type="log"][@xml:id="LOG____"]
Individual E-mail message => div[@type="log"]/post
teiHeader
Ubicación del servidor original de TEI Listerv, archivos de registro => fileDesc/sourceDesc
Método de extracción => encodingDesc/samplingDecl
text/body
Mensualmente colección de registros => text/body/div[@type="log"][@xml:id="LOG____"]
Mensaje de correo electrónico individual => div[@type="log"]/post
post/dateline
Date => date/@when | dateline/date/text()
Reply-To => ref[@type="reply-to"][@target="email:___]
Sender => ref[@generatedBy="system"][@type="sender"][@target="email:TEI-L@__]
From => ref[@generatedBy="template"][@type="from"][@target="email:___"]
Subject => title[@level="a"][@generatedBy="human"]
In-Reply-To => ref[@target="#id-of-earlier-posting]
<post xml:id="Web-1990-01-31-0933">
<dateline>
<date when="1990-01-31">Wed, 31 Jan 90 09:33:26 CST</date>
<ref generatedBy="system" type="reply-to" target="email:TEI-L@UICVM">Text Encoding
Initative public discussion list </ref>
<ref generatedBy="system" type="sender" target="email:TEI-L@UICVM">Text Encoding
Initative public discussion list</ref>
<ref generatedBy="template" type="from" target="email:WEBER@HARVARDA.BITNET">Robert
Philip Weber</ref>
<title level="a" type="subject" generatedBy="human">compound documents and
images</title>
</dateline>
<p>could someone please explain the <name type="ML">TEI</name> approach to compound
documents and images? WIll <name type="ML">SGML</name> be used here, and if so,
how? I've just joined the list. sorry if this has been asked before.</p>
<p>Many Thanks</p>
<signed generatedBy="human">Bob Weber</signed>
<signed generatedBy="template"> Robert Philip Weber, Ph.D. | Phone: (617) 495-3744
<lb/>Senior Consultant | Fax: (617) 495-0750 <lb/>Academic and Planning Services |
<lb/>Division | <lb/>Office For Information Technology| Internet:
weber@popvax.harvard.edu <lb/>Harvard University | Bitnet: Weber@Harvarda <lb/>50
Church Street | <lb/>Cambridge MA 02138 | </signed>
</post>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Text Encoding Initiative public discussion list</title>
</titleStmt>
<editionStmt>
<edition>January and February 1990 in the TEI Listserv
<!-- TEI-L LOG9001, LOG9002 --></edition>
</editionStmt>
<publicationStmt>
<!-- about the born-digital document -->
<publisher>https://github.com/tei-cmc-experiment/tei-cmc-experiment</publisher>
</publicationStmt>
<sourceDesc>
<bibl>
<title level="j">TEI-L Listserv</title>
<title level="s">LOG9001</title>
<title level="s">LOG9002</title>
<publisher>University of Illinois Chicago</publisher>
<distributor>TEI-L@UICVM</distributor>
<date>1990</date>
<relatedItem type="archive">
<bibl>
<publisher>The Pennsylvania State University</publisher>
<distributor>LISTS.PSU.EDU LISTSERV Server (17.0)</distributor>
<date>2024</date>
</bibl>
</relatedItem>
</bibl>
</sourceDesc>
</fileDesc>
<encodingDesc>
<samplingDecl>
<p>Sampled by requesting monthly logs from <name type="API">LISTS.PSU.EDU</name> by
e-mail with GET commands: <code>GET TEI-L LOG 9001</code> (for January 1990). One log
command was issued for each month. See <ptr type="APIdoc"
target="https://www.lsoft.com/manuals/17.0/commands/14File-serverandwebfunctioncomma.html"/>.
Received by e-mail between <date from="2024-09-21" to="2024-09-22">September 21
and 22, 2024</date>.</p>
</samplingDecl>
</encodingDesc>
</teiHeader>
teiHeader/profileDesc/correspDesc
From => correspAction[@type="sent"]/persName
Date => correspAction[@type="sent"]/date
Sender => correspAction[@type="relayed"]/orgName/ref[@target="email:TEI-L@__]
In-Reply-To => correspContext/ref[@type="in-response-to"][@target="#id-of-earlier-posting]
Subjet => teiHeader/fileDesc/titleStmt/title[@type="subjectLine"]
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<!-- SAME AS PREVIOUS EXAMPLE . . . -->
</fileDesc>
<encodingDesc>
<!-- SAME AS PREVIOUS EXAMPLE . . . -->
</encodingDesc>
</teiHeader>
<TEI xml:id="Web-1990-01-31-0933">
<teiHeader>
<fileDesc>
<titleStmt>
<title type="subjectLine">compound documents and images</title>
<author>Robert Philip Weber</author>
</titleStmt>
<publicationStmt>
<p>Text Encoding Initative public discussion list</p>
</publicationStmt>
<sourceDesc>
<bibl><title level="s">LOG9001</title></bibl>
</sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language ident="en">English</language>
</langUsage>
<correspDesc>
<correspAction type="sent">
<persName>Robert Philip Weber</persName>
<email>WEBER@HARVARDA.BITNET</email>
<date when="1990-01-31">Wed, 31 Jan 90 09:33:26 CST</date>
</correspAction>
<correspAction type="relayed">
<orgName><ref target="mailto:TEI-L@UICVM">Text Encoding
Initative public discussion list</ref></orgName>
</correspAction>
</correspDesc>
</profileDesc>
</teiHeader>
<text>
<body>
<post>
<p>could someone please explain the <name type="ML">TEI</name> approach to compound
documents and images? WIll <name type="ML">SGML</name> be used here, and if so,
how? I've just joined the list. sorry if this has been asked before.</p>
<p>Many Thanks</p>
<signed generatedBy="human">Bob Weber</signed>
<signed generatedBy="template"> Robert Philip Weber, Ph.D. | Phone: (617) 495-3744
<lb/>Senior Consultant | Fax: (617) 495-0750 <lb/>Academic and Planning Services |
<lb/>Division | <lb/>Office For Information Technology| Internet:
weber@popvax.harvard.edu <lb/>Harvard University | Bitnet: Weber@Harvarda <lb/>50
Church Street | <lb/>Cambridge MA 02138 | </signed>
</post>
</body>
</text>
</TEI>
<TEI xml:id="Spe-1990-02-06-1054">
<teiHeader>
<fileDesc>
<titleStmt>
<title type="subjectLine">compound documents</title>
<author>Michael Sperberg-McQueen</author>
</titleStmt>
<publicationStmt>
<p>Text Encoding Initative public discussion list</p>
</publicationStmt>
<sourceDesc>
<bibl><title level="a">LOG9002</title></bibl>
</sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language ident="en">English</language>
</langUsage>
<correspDesc>
<correspAction type="sent">
<persName>Michael Sperberg-McQueen</persName>
<email>U35395@UICVM.BITNET</email>
<date when="1990-02-06">Tue, 6 Feb 90 10:54:04 CST</date>
</correspAction>
<correspAction type="relayed">
<orgName><ref target="TEI-L@UICVM">Text Encoding
Initative public discussion list</ref></orgName>
</correspAction>
<correspContext>
<ref type="in-response-to" target="#Web-1990-01-31-0933">previous message of
<persName>Robert Philip Weber</persName> to the TEI-L Listserv.
<date when="1990-01-31"/>
</ref>
</correspContext>
</correspDesc>
</profileDesc>
</teiHeader>
<text>
<body>
<post>
<p>About compound documents in <name type="ML">SGML</name> and in the <name type="ML"
>TEI</name>. <ref target="#Web-1990-01-31-0933"><persName>R.P. Weber</persName>
asked a week ago <q>could someone please explain the <name type="ML">TEI</name>
approach to compound documents and images? WIll <name type="ML">SGML</name>
be used here, and if so, how?</q></ref></p>
<p>Apologies for my delay in answering. I was hoping one of our hypertext sages might
weigh in with a reply. (But he appears to have been in the
<placeName>Caribbean</placeName>, and may not have received the query.)</p>
<p>This problem has not, in fact, been discussed on this server, or as far as I'm
aware by the working committees. So the formal answer is that no decision has been
cast in concrete yet. Which allows me to turn the tables and say <q>How <emph
rend="*">should</emph> compound documents be encoded for interchange? What
are the requirements? What are the alternatives?</q></p>
<p>Less formally, I can offer some personal opinions, for what they are worth (face
value: two cents -- 2u, for those of you on <objectName>IBM
mainframes</objectName> with real <objectName>IBM terminals</objectName>).</p>
<p> Certainly <name type="ML">SGML</name> is where we should start in any search for
ways of handling compound objects, and I don't yet know any reason that SGML won't
provide a solution for the problem. I assume there are two methods of using <name
type="ML">SGML</name> in compound documents (correct me if I'm wrong): (1) use
<name type="ML">SGML</name> to organize the compound document (i.e. have an
SGML document which includes text, images, sound, etc. as its components), or (2)
use whatever-you-like to organize the compound document as a whole, and use <name
type="ML">SGML</name> as the notation for the textual components of the
compound document.</p>
<p>For the simple case of text-with-illustrations, <name type="ML">SGML</name> seems
like a viable encoding mechanism (for the envelope and for the text components) to
me. It allows you to encode the graphics however you like, declaring your graphics
format as a non-<name type="ML">SGML</name> notation and declaring the contents of
your graphics elements (say 'PICTURE' or 'BLORT') as being data in that notation,
stored either within the <name type="ML">SGML</name> file or externally to it. You
get localization of the graphics within the text stream, integral or separate
storage of the graphics, and complete freedom to choose whatever graphic notation
you wish.</p>
<p>As document encoding methods go, <name type="ML">SGML</name> is fairly hospitable
to graphics and other non-text pieces of compound objects. Nowhere in the standard
does it say that the data have to be words and characters. In fact, as far as I
know there is no <emph rend="*">explicit</emph> requirement in the standard that
an <name type="ML">SGML</name> document even has to be bytes in a computer. (Sure,
it's hard to understand the standard any other way, but that's not the same as an
explicit requirement.) <bibl>ISO 8879 par. 6.1 note 1</bibl> says in fact <q>This
International Standard does not constrain the physical organization of the
document within the data stream, message handling protocol, file system, etc.,
that contains it.</q> At the <eventName>SGML '89 conference</eventName> last
<date when="1989-10">October</date> in <placeName>Atlanta</placeName>, there
was a very nice paper by <persName>Douglas MacLeod</persName> (read by
<persName>Yuri Rubinsky</persName>) thinking about architectural designs as
<name type="ML">SGML</name> documents, which led to a general discussion of
<name type="ML">SGML</name> definitions for all sorts of objects, including
automobiles. Although most people (obviously) think of the <name type="ML"
>SGML</name> document as an electronic <emph rend="*">description</emph> of the
automobile (and the physical automobile as a side effect of processing), it
appears, in the light of the passage cited, hard to say categorically that an
automobile itself could never be parsed as an <name type="ML">SGML</name>
document. (If you could figure out how to define the delimiters.)</p>
<p>The only hitch is that the <name type="ML">SGML</name> standard itself (<bibl
type="standard">ISO 8879</bibl>) does not specify in any detail what the
interface between <name type="ML">SGML</name> processors and non-<name type="ML"
>SGML</name> processors must, may, or can look like -- an advantage, if you
will, in that it doesn't constrain anyone to an inappropriate model, but a bit of
a disadvantage in that most people don't have a clue what they can now or will
eventually or might someday be able to do with SGML and graphics processors.</p>
<p>Not being deeply involved in graphics work or compound documents myself, I don't
know off-hand what options are offered for this sort of thing by existing <name
type="ML">SGML</name> processors. There will certainly be a fierce market
demand for it, not only from humanists but also (to our great advantage) from the
defense industry, which needs <name type="ML">SGML</name> support for technical
manuals with diagrams (and of course cross-references and other hypertext
mechanisms) and has the small change to pay for the development costs. (As long as
they don't charge the humanists defense-contractor prices!)</p>
<p>If for some reason one does <emph rend="*">not</emph> want to use <name type="ML"
>SGML</name> as the envelope for the entire compound document, then presumably
the major requirement for the text-components of the compound documents is that
they be computationally well-behaved, with a clearly defined structure, hooks for
pointers going out, and hooks for pointers coming in. SGML certainly has all of
this, in its document type declarations and its ID names and its IDREF
pointers.</p>
<p>Perhaps those subscribers to this list who actually work with compound documents
and <name type="ML">SGML</name> will be willing to say how they make things work
now, and how they would like to see things developing in the future.</p>
<p> All this is, I repeat, just personal opinion and shouldn't be taken as defining
<soCalled>the</soCalled> position of the <orgName>TEI</orgName>. (Unless, of
course, taking as <soCalled>the</soCalled> position will help get a discussion
started.)</p>
<signed generatedBy="human">-Michael Sperberg-McQueen<lb/> University of Illinois at
Chicago </signed>
</post>
</body>
</text>
</TEI>
</TEI>
TEI provides no good way to encode an e-mail Subject line in either <correspDesc> or <dateline>
<head>, <label> not allowed and are not ideal anyway
<title> not quite apt
Could we put <head> in <post>?
Yes, it is allowed, and it does make sense
No: it is part of the metadata from email heading section, not part of the e-mail body
TEI no proporciona ninguna buena manera de codificar una línea de sujeto de un correo en <correspDesc> ni en <dateline>
<head>, <label> no se permiten ir de todos modos no son ideales
<title> no es enteramente apto
Could we put <head> in <post>?
Sí, se permite, y tiene sentido
No: es parte de los metadatos de la sección cabecera, no parte del cuerpo del correo
e-mail addresses in <person>:
<email> not allowed directly in <person>
<email> is poorly defined
<post> allows <lb/> but not <line>
direcciones de correo en <person>:
<email> no se permite directamente en <person>
<email> está mal definido
<post> permite <lb/> pero no <line>
What @level values are appropriate for the <title>s of:
TEI-L Listserv as a whole?
LOGYYMM files?
<post> elements?
Do we need new values of @level for born-digital compound artifacts?
title[@level="s"]: (series)
title[@level="j"]: (journal)
title[@level="m"]:(monographic)
title[@level="a"]: (analytic)
¿Qué valores de @level son apropiados para los <title> de:
¿TEI-L Listserv en su conjunto?
¿Archivos LOGYYMM?
Elementos <post>?
¿Necesitamos nuevos valores de @level para artefactos de origen digital compuestos?
Source: public TEI-L web archive, selected via URL:
https://lists.psu.edu/cgi-bin/wa?A1=199001-199512&L=TEI-L
good possibilities for directly outputting XML
Can define the XML output structure from message headers and HTML element structure
Works with XPath 2.0 to retrieve HTML element contents
Fuente: archivo web público de TEI-L, seleccionado a través de URL:
https://lists.psu.edu/cgi-bin/wa?A1=199001-199512&L=TEI-L
Buenas posibilidades para generar directamente XML
Puede definir la estructura de salida XML a partir de encabezados de mensajes y estructura de elementos HTML
Funciona con XPath 2.0 para recuperar contenidos de elementos HTML
<?xml version="1.0" encoding="utf-8"?>
<emails>
<email>
<emailId>ef5ba1b</emailId>
<date>Tue, 5 Sep 1995 10:39:24 CDT</date>
<reply-to>Text Encoding Initiative public discussion list
<TEI-L@UICVM.UIC.EDU></reply-to>
<from>Arjan.Loeffen@LET.RUU.NL</from>
<subject>DYNATEXT and TEI: level of support?</subject>
<body>Dear reader,
Peter Robinson writes in 'The transcription...' [...]</body>
</email>
<email>
<emailId>cd018892</emailId>
<date>Wed, 6 Sep 1995 11:04:31 CDT</date>
<reply-to>Text Encoding Initiative public discussion list
<TEI-L@UICVM.UIC.EDU></reply-to>
<from>"Steven J. DeRose" <sjd@ebt.com></sender>
<subject>Re: DYNATEXT and TEI: level of support?</subject>
<body>Peter Robinson writes in 'The transcription...' [...]
Caveat for anyone new to the list: I *am* connected to DynaText, [...]
</body>
</email>
</emails>
This could work well:
We could convert the elements into TEI and model with CMC and Correspondence encoding
TEI-L web archive delivers the distinct identifier of the message in its URL: https://lists.psu.edu/cgi-bin/wa?A2=TEI-L;cd018892.9509
It most likely automatically converts base64 content
Esto podría funcionar bien:
Podríamos convertir los elementos a TEI y modelar con los módulos de CMC y Correspondencia
El archivo web TEI-L proporciona el identificador único del mensaje en su URL: https://lists.psu.edu/cgi-bin/wa?A2=TEI-L;cd018892.9509
Lo más probable es que convierta automáticamente el contenido base64
Problems: Dense scripting on TEI-L archive webpages!
Sender name / email information is set in the same HTML element content, and not reliably separated
Successfully extracted URLs for messages from the TEI-L web archive, but parsing the messages failed. (Abandoned for the conference due to time constraints, but we'd like to return to it. . .)
Problemas: ¡Secuencias de comandos densas en las páginas web de archivos TEI-L!
El nombre del remitente y la información de correo electrónico se establecen en el mismo contenido del elemento HTML y no siempre están separados
Se extrajeron correctamente las URL de los mensajes del archivo web de TEI-L, pero falló el análisis de los mensajes. (Abandonado para la conferencia por falta de tiempo, pero nos gustaría volver al tema...)
"Scraping" extracts data from a published web page, digging into a labyrinth of code.
Accessing log files is much simpler and more direct.
Sent a series of "GET" commands by e-mail to the listserv, and it responds with monthly log files.
While Syd worked with all 412 log files, Elisa just pulled some "core samples" from early in the archive to study.
"Scraping" extrae datos de una página web publicada, entrando en un laberinto de código.
Acceder a los archivos de registro es mucho más sencillo y directo.
Se enviaron una serie de comandos "GET" por correo electrónico al servidor de listas y responde con archivos de registro mensuales.
Mientras Syd trabajaba con los 412 archivos de registro, Elisa simplemente extrajo algunas "muestras principales" del comienzo del registro para estudiarlas.
By Elisa Beshero-Bondar
Professor of Digital Humanities and Chair of the Digital Media, Arts, and Technology Program at Penn State Erie, The Behrend College.