LISTSERV ⇒ TEI

Elisa Beshero-Bondar

Penn State Erie

Link to these slides: 

Syd Bauman

Northeastern University

TEI 2024 Conference in Buenos Aires

Jue 10 Oct 24 15:30

Introduction:

Why we’re doing this

Introducción:

Por qué lo hacemos

  • New TEI module for computer mediated communication

  • TEI has significant sets of data from computer communication platforms

  • Nuevo módulo TEI para comunicación mediada por computadora

  • TEI cuenta con importantes conjuntos de datos proveniente de plataformas de comunicación informática

 

  • CMC not intended specifically for e-mail conversations

  • but neither is <correspDesc>

  • “We choose to do [these] things, not because they are easy, but because they are hard; because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept” — John F. Kennedy, 1962-09-12

 

  • CMC no está destinada específicamente para conversaciones por correo electrónico

  • pero tampoco lo está <correspDesc>

  • Escogimos hacer estas cosas, no porque sean fáciles, sino porque son difíciles, porque ese objetivo servirá para organizar y medir lo mejor de nuestras energías y habilidades, porque ese desafío es uno que estamos dispuestos a aceptar, John F. Kennedy, 1962-09-12

Las listas de correo electrónico son un desafío

E-mail lists are challenging:

  • We chose to start with TEI-L

    • large, interesting, important historical dataset

    • LISTSERV provides easy access to the data in raw form

    • we had recently been working with it (to move from Brown to PSU)

  • Decidimos empezar con TEI-L

    • conjunto de datos históricos interesante y grande

    • LISTSERV proporciona acceso fácil a los datos en su forma bruta

    • hemos estado recientemente trabajado con él (para transferirlo desde Brown a PSU)

  • Initial thought:

    • converting data to XML will be easy

    • deciding exactly how to encode in TEI will be interesting

  • Turns out both parts were problematic

  • Pensamiento inicial:

    • convertir los datos a XML será fácil

    • decidir exactamente cómo codificar en TEI será interesante

  •   Resulta que las dos partes han sido problemáticas

Syd — preparing the source

Syd – preparando la fuente

  • Goal — convert all of TEI-L up to 2024-04-29

  • Obtained data directly from Brown listmaster

  • 412 separate files, one for each month, i.e.
    tei-l.log9001 through tei-l.log2404

  • Renamed files to use 4-digit year so they sorted into right order

  • Meta — convertir toda la TEI-L hasta 2024-04-29

  • Obtuvimos los datos directamente de la listmaster de Brown

  • 412 archivos individuales, uno para cada mes, i.e. de
    tei-l.log9001 a tei-l.log2404

  • Renombramos los archivos usando cuatro dígitos para el año, de manera de ordenarlos correctamente

illegal characters (for XML)

  • Early e-mail systems used ASCII (7-bit) characters only

  • Of those 128 available characters, only 99 of them are legal XML characters

  • Of the 29 characters that are not legal in XML, 17 of them occur in the TEI-L archives

    • 64 occurrences in 34 posts before 2000 (not surprising)

    • 38 occurrences in 11 posts after 2008 (surprised me)

    • the last 2 of which are in a single post by David Maus on 2016-03-13

      • He was copying what someone else wrote

Caracteres ilegales (en XML)

  • Sistemas de e-mail tempranos usaban solamente caracteres ASCII (7-bit) 

  • De los 128 caracteres disponibles, sólo 99 son caracteres legales en XML

  • De los 29 caracteres no permitidos en XML, se encuentran 17 en los archivos TEI-L

    • 64  instancias en 34 posts antes de 2000 (no sorprendente)

    • 38 instancias en in 11 posts después de 2008 (sí me sorprendió)

    • Los dos últimos ejemplares se encuentran en un solo post de David Maus 2016-03-13

      • Él copió lo que otra persona escribió

possible solutions

  • These are not invalid characters, so iconv -c or oXygen’s “Encoding errors handling” feature are not helpful

  1. Delete them, leaving users to figure out missing characters

  2. Replace them with another character(s), thus marking the location of the problematic character

  3. For each case try to figure out what character the author intended, and replace with the proper UTF-8 character

    • possibly preserving information about the original encoding

soluciones posibles

  • No son caracteres inválidos, por lo tanto iconv -c no es útil ni tampoco lo es la función de oXygen “Encoding errors handling”

  1. Borrarlos y permitir a los usuarios averiguar los caracteres que faltan

  2. Reemplazarlos con otro carácter(es), marcando la ubicación del carácter problemático

  3. En cada caso, tratar de averiguar cuál era el caracter deseado por el autor, y reemplazarlo por el carácter UTF-8 indicado

    • Posiblemente preservando información sobre la codificación original

Elisa — extracting from the source

Elisa — extrayendo desde la fuente

Elisa's objective:

  • Take "core samples" (all the messages in a short range of months or years) from the TEI-L to figure out how to model them in TEI
     

  • "Scrape" these with Python's Scrapy / lxml etree from the TEI-L web archive, output the email posts and metadata in XML
     
  • Successfully extracted URLs for messages from the TEI-L web archive, but parsing the messages failed. (Abandoned for the conference due to time constraints, but we'd like to return to it. . .)

La meta de Elisa

  • Tomar "muestras centrales" de la TEI-L para descubrir cómo modelarlas en TEI

  • "Raspe" estos con Scrapy / lxml etree de Python del archivo web TEI-L, genere las publicaciones de correo electrónico y los metadatos en XML
  • Se extrajeron correctamente las URL de los mensajes del archivo web de TEI-L, pero falló el análisis de los mensajes. (Abandonado para la conferencia por falta de tiempo, pero nos gustaría volver al tema...)

 

Some Complexities

Algunas complejidades

ASCII Workaround: Encoded-word syntax

  • Commonly used in header field contents

  • =?charset?encoding?text?=

  • E.g., =?ISO-8859-2?Q?Piotr_Ba=F1ski?=
        =?UTF-8?Q?Piotr_Ba=C5=84ski?=

  • Leave as-is, or convert to actual characters?
    Piotr Bański

ASCII solución alternativa: Sintaxis de la palabra codificada

  • usado con frecuencia en los contenido de campos de cabeceras

  • =?charset?encoding?text?=

  • P. ej., =?ISO-8859-2?Q?Piotr_Ba=F1ski?=
        =?UTF-8?Q?Piotr_Ba=C5=84ski?=

  • ¿dejarlo como está, o convertirlo a caracteres de verdad?
    Piotr Bański

MIME = Major Incoherent Mess in Email

MIME = Mega Incoherente Maraña de Email

  •  Multipurpose Internet Mail Extension: Estándar para permitir

    •  Más que solo los caracteres 7-bit ASCII

    • HTML

    • Archivos adjuntos

  • Definido por RFC 2045,  RFC 2046,  RFC 2047,  RFC 4288,  RFC 4289, y  RFC 2049. Sí, es muy complicado.

MIME types & character set

  • Often involves repeating content of mail in two (or more) parts:

    • common: text/plain, text/html

    • rare: text/enriched, text/x-vcard, text/xml, application/*

  • Each part (regardless of type) is also encoded using a declared character set

    • common: UTF-8, ISO-8859-*, US-ASCII, Windows-1252

    • rare: macintosh, big5, Windows-1255, euc-kr, unknown

tipos de MIME y conjunto de caracteres

  • Involucra con frecuencia la repetición del contenido del correo en dos partes (o más):

    • habitual: text/plain, text/html

    • poco frecuente: text/enriched, text/x-vcard, text/xml, application/*

  • Cada parte (independiente del tipo) también es codificada usando un conjunto de caracteres declarado

    • common: UTF-8, ISO-8859-*, US-ASCII, Windows-1252

    • rare: macintosh, big5, Windows-1255, euc-kr, unknown

MIME boundary example

Ejemplo de límite MIME

Because I've read it (I think in the late eigthies) in the=20
"Gentle Introduction". And the phrase is already there
in the World Wide Web as we see by google-ing=20
"making explicit what is conjectural or implicit"
with a lot of derived occurences.

Ask some elder man from the TEI board
(or encoding philosopher Lachance)
if my understanding of TEIs approach to encoding
is so odd as my Englisch ;-)

Best regards,
Herbert



--part1_97.5661ed07.2f108bb9_boundary
Content-Type: text/html; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML><FONT FACE=3Darial,helvetica><HTML><FONT  SIZE=3D3 PTSIZE=3D12 FAMILY=
=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2">In einer eMail vom 07.01.05 11:30:2=
0 (MEZ) Mitteleurop=E4ische Zeit schreibt sebastian.rahtz@COMPUTING-SERVICES=
.OXFORD.AC.UK:<BR>
<BR>
</FONT><FONT  COLOR=3D"#000000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR:=20=
#ffffff" SIZE=3D2 PTSIZE=3D10 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2"=
><BR>
<BLOCKQUOTE TYPE=3DCITE style=3D"BORDER-LEFT: #0000ff 2px solid; MARGIN-LEFT=
: 5px; MARGIN-RIGHT: 0px; PADDING-LEFT: 5px">Thats an odd assertion. Why do=20=
you think the TEI mandates<BR>
"Make explicit the implicit" ?<BR>
</BLOCKQUOTE><BR>
</FONT><FONT  COLOR=3D"#000000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR:=20=
#ffffff" SIZE=3D3 PTSIZE=3D12 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2"=
><BR>
Because I've read it (I think in the late eigthies) in the <BR>
"Gentle Introduction". And the phrase is already there<BR>
in the World Wide Web as we see by google-ing <BR>
"</FONT><FONT  COLOR=3D"#000000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR:=
 #ffffff" SIZE=3D2 PTSIZE=3D10 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2=
">making explicit what is conjectural or implicit</FONT><FONT  COLOR=3D"#000=
000" BACK=3D"#ffffff" style=3D"BACKGROUND-COLOR: #ffffff" SIZE=3D3 PTSIZE=
=3D12 FAMILY=3D"SANSSERIF" FACE=3D"Arial" LANG=3D"2">"<BR>
with a lot of derived occurences.<BR>
<BR>
Ask some elder man from the TEI board<BR>
(or encoding philosopher Lachance)<BR>
if my understanding of TEIs approach to encoding<BR>
is so odd as my Englisch ;-)<BR>
<BR>
Best regards,<BR>
Herbert<BR>
<BR>
<BR>
</FONT></HTML>
--part1_97.5661ed07.2f108bb9_boundary--
  • Email servers, etc., use at most 8-bits per character (i.e. 256 chars)

  • Thus every character set that uses any of the other 154,807 characters must be down-translated into only those 256

    • quoted-printable

    • base64

  • No XPath functions to translate back (EXPath for base64)

MIME character encoding

  • Servidores de email, etc., usan como mucho 8-bits por caracter (i.e. 256 caracteres)

  • Así que cada conjunto de caracteres que usa cualquier de los otros 154,807 tiene que convertirse a esos 256 

    • quoted-printable

    • base64

  • No hay funciones XPath para revertir (EXPath para base64)

codificación MIME

Q2lhcsOhbuKAmXMgb2JzZXJ2YXRpb24gZG9lcyBub3Qgc3F1YXJlIHdpdGggb3VyIGV4cGVyaWVu
Y2UgaW4gdGhlIEVhcmx5UHJpbnQgcHJvamVjdC4gIENvbnNpZGVyIOKAmGhhbmRzb21lLCBjbGV2
ZXIsIGFuZCByaWNo4oCZIGZyb20gdGhlIG9wZW5pbmcgc2VudGVuY2Ugb2YgRW1tYS4gVGhlcmUg
bWF5IGJlIG9jY2FzaW9ucyB3aGVyZSB5b3Ugd2FudCB0byBpZGVudGlmeSBwaHJhc2VzIGxpa2Ug
dGhhdCBpbiBzb21lIG90aGVyIGNvcnB1cy4NCg0KV2VsbCwgZ28gdG8gaHR0cDovL2JsYWNrbGFi
LmVhcmx5cHJpbnQub3JnL2NvcnB1c3NlYXJjaC8gYW5kIGVudGVyIHRoZSBzZWFyY2ggdGVybQ0K
DQpbcG9zPSJqIl1bcG9zPSJqIl1bImFuZCJdW3Bvcz0iaiJdDQoNCldpdGhpbiBzZWNvbmRzIGl0
IHdpbGwgcmV0cmlldmUgOCwyOTEgbWF0Y2hlcyBmcm9tIHRleHRzIGJldHdlZW4gMTY0MCBhbmQg
MTY2MCwgbXkgcGVyc29uYWwgZmF2b3VyaXRlIGJlaW5nIOKAnHRoZSBTY290dGlzaCBncm93ZXMg
ZHVsbGUsIEZyb3N0aWUsIGFuZCB3YXl3YXJkLuKAnQ0KDQpJIGFtIHRvbGQgYnkgUGhpbCBCdXJu
cywgd2hvIGtub3dzIGEgbG90IGFib3V0IHRoZXNlIHRoaW5ncywgdGhhdCB0aGUgQmxhY2tsYWIg
c2VhcmNoIGVuZ2luZSBpcyByZWxhdGl2ZWx5IGVhc3kgdG8gaW5zdGFsbC4gSXQgYWxzbyBzdXBw
b3J0cyBpbmNyZW1lbnRhbCBpbmRleGluZywgd2hpY2ggaXMgYSBiaWcgaGVscC4gIFRoZSBjdXJy
ZW50IHVzZXIgaW50ZXJmYWNlIGlzIHZlcnkgU3BhcnRhbiwgYW5kIGEgdXNlciBoYXMgdG8ga25v
dyB0aGUgdGFnIHNldCBvbiB3aGljaCB0aGUgc2VhcmNoZXMgYXJlIGJhc2VkLiBCbGFja2xhYiBp
cyBlbGVtZW50IGF3YXJlIGluIHNpbXBsZSB3YXlzIHRoYXQgd2lsbCBzdXBwb3J0IG1hbnkgb2Yg
dGhlIHVzZXMgdGhhdCBjb21lIHVwIGluIGxpdGVyYXJ5IHNjaG9sYXJzaGlwLiBGb3IgaW5zdGFu
Y2UsIHlvdSBjYW4gbG9vayBmb3IgYWRqZWN0aXZlcyBiZWZvcmUg4oCYbGliZXJ0eeKAmSBpbiBw
b2V0cnkuIEFuZCBzbyBvbi4NCg0KDQoNCkZyb206ICJURUkgKFRleHQgRW5jb2RpbmcgSW5pdGlh
dGl2ZSkgcHVibGljIGRpc2N1c3Npb24gbGlzdCIgPFRFSS1MQExJU1RTRVJWLkJST1dOLkVEVT4g
b24gYmVoYWxmIG9mIFNlcmdlIEhlaWRlbiA8c2xoQEVOUy1MWU9OLkZSPg0KT3JnYW5pemF0aW9u
OiBFTlMgZGUgTHlvbg0KUmVwbHktVG86IFNlcmdlIEhlaWRlbiA8c2xoQEVOUy1MWU9OLkZSPg0K
RGF0ZTogVHVlc2RheSwgQXByaWwgMywgMjAxOCBhdCA4OjMwIEFNDQpUbzogIlRFSSAoVGV4dCBF
bmNvZGluZyBJbml0aWF0aXZlKSBwdWJsaWMgZGlzY3Vzc2lvbiBsaXN0IiA8VEVJLUxATElTVFNF
UlYuQlJPV04uRURVPg0KU3ViamVjdDogUmU6IDxjPiB0YWcNCg0KSGkgQ2lhcsOhbiwNCg0KTGUg
MjkvMDMvMjAxOCDDoCAwMDo1MCwgQ2lhcsOhbiDDkyBEdWliaMOtbiBhIMOpY3JpdCA6DQoNCi4u
Lg0KT3ZlcmFsbCBteSBjb25jbHVzaW9uIGlzIHRoYXQgdGhlcmUgaXMgbGl0dGxlIHBvaW50IGlu
IGNvbnZlcnRpbmcgdG8gVEVJIGEgY29ycHVzIG9mIHRleHRzIGludGVuZGVkIGZvciBpbmRleGlu
Zy9yZXRyaWV2YWwsIGFzIGl0IGRvZXMgbm90IG1lYW4gdGhleSBjYW4gYmUgZWFzaWx5IHVzZWQg
d2l0aCBtb3JlIGFwcGxpY2F0aW9ucyBhbmQgb24gbW9yZSBwbGF0Zm9ybXMuICBJZiBYYWlyYSBo
YWQgY29udGludWVkIHRvIGJlIGRldmVsb3BlZCwgdGhpcyBtaWdodCBoYXZlIGJlZW4gZGlmZmVy
ZW50Lg0KLi4uDQoNClRoYW5rIHlvdSBmb3IgdGhlIHJlcG9ydCBvbiB0aGUgYXBwbGljYXRpb25z
Lg0KDQpXaGF0IHdvdWxkIGhlbHAgYSBsb3Qgd291bGQgYmUgdG8gbGlzdCBleHBsaWNpdGx5IHNv
bWUgc2VydmljZXMgb3IgZmVhdHVyZXMgb2YgWGFpcmEgdXNlZnVsIG9yIG5lY2Vzc2FyeSBmb3Ig
eW91IHRoYXQgYXJlIG5vdCBmb3VuZCBpbiB0aGUgc29mdHdhcmUgZGlzY3Vzc2VkLiBTb21laG93
IHRoZSByZWxldmFudCBmZWF0dXJlcyBvZiBYTUwgZWRpdG9ycyBmb3IgdGVhY2hpbmcgaGF2ZSBi
ZWVuIGRpc2N1c3NlZCBhbmQgc3ludGhlc2l6ZWQgaGVyZTogaHR0cHM6Ly93aWtpLnRlaS1jLm9y
Zy9pbmRleC5waHAvRWRpdG9yX2Zvcl90ZWFjaGluZ19URUlfLV9mZWF0dXJlczxodHRwczovL3Vy
bGRlZmVuc2UucHJvb2Zwb2ludC5jb20vdjIvdXJsP3U9aHR0cHMtM0FfX3dpa2kudGVpLTJEYy5v
cmdfaW5kZXgucGhwX0VkaXRvci01RmZvci01RnRlYWNoaW5nLTVGVEVJLTVGLTJELTVGZmVhdHVy
ZXMmZD1Ed01GYVEmYz15SGxTMDRIaEJyYWVzNUJROXVldTV6S2hFN3J0Tlh0X2QwMTJ6MlBBNndz
JnI9ckc4enhPZHNzcVN6RFJ6NHgxR0xsbUxPVzYweHlWWHlkeHduSlpwa3hiayZtPW94bFkwVFI1
LTRVWnVaY0F1b0FzWnh1bnNwM1JXNEl5RDhqazBhdmlYVGMmcz1FRkJhdUlnQmZObkNqU2ZINHpp
NzZxbC1Lc1NFR1dVc05hQk5uemphdHpFJmU9Pg0KDQpCZXN0LA0KU2VyZ2UNCg0KLS0NCg0KRHIu
IFNlcmdlIEhlaWRlbiwgc2xoIEFUIGVucy1seW9uLmZyLCBodHRwOi8vdGV4dG9tZXRyaWUuZW5z
LWx5b24uZnI8aHR0cHM6Ly91cmxkZWZlbnNlLnByb29mcG9pbnQuY29tL3YyL3VybD91PWh0dHAt
M0FfX3RleHRvbWV0cmllLmVucy0yRGx5b24uZnImZD1Ed01GYVEmYz15SGxTMDRIaEJyYWVzNUJR
OXVldTV6S2hFN3J0Tlh0X2QwMTJ6MlBBNndzJnI9ckc4enhPZHNzcVN6RFJ6NHgxR0xsbUxPVzYw
eHlWWHlkeHduSlpwa3hiayZtPW94bFkwVFI1LTRVWnVaY0F1b0FzWnh1bnNwM1JXNEl5RDhqazBh
dmlYVGMmcz1PRWRCQWsta05GZEJlRkFaUUY3Qk5MVG9oc0x1Q3JEU09PUG1zSkU3WVNVJmU9Pg0K
DQrDiXF1aXBlIGRlIHJlY2hlcmNoZSBDYWN0dXMsIGxhYm9yYXRvaXJlIElIUklNIFVNUjUzMTcs
IEVOUyBkZSBMeW9uDQoNCjE1LCBwYXJ2aXMgUmVuw6kgRGVzY2FydGVzIDY5MzQyIEx5b24gQlA3
MDAwIENlZGV4LCB0w6lsLiArMzMoMCk2MjIwMDM4ODMNCg==

Base64 Example

Ejemplo de Base64

MIME parts

  • It is common for the same content to be presented in multiple content types

    • text/plain

    • text/html (what to do?)

    • rarely (19 of 25,415) other content types

  • Regardless of content type, it may be encoded as 7-bit, 8-bit, quoted-printable, or base64 in any of various character encodings

partes de MIME

  • Frecuentemente, el mismo contenido es presentado en múltiples tipo de contenido

    • text/plain

    • text/html (¿qué hacer?)

    • rara vez (19 de 25,415) otros tipos de contenido

  • Con independencia del tipo de contenido, se puede codificar como 7-bit, 8-bit, quoted-printable, o base64 en cualquiera de los varios tipos de codificación de caracteres

Mappings to TEI:

 Combining CMC and Correspondence encoding

Mapeo a TEI:
Combinando CMC y codificación de correspondencia

cuantos  campo                      descripción
 30,995  Date                       fecha
 30,995  From                       de
 30,995  Reply-To                   responder a
 30,995  Sender                     remitente
 30,883  Subject                    asunto (o tema)
 28,687  MIME-Version               versión MIME*
 28,677  Content-Type               tipo de contenido                 
 19,896  In-Reply-To                en respuesta a                  
 19,568  Content-Transfer-Encoding  codificación de transferencia de contenido
 14,270  Comments                   comentarios
 12,757  Message-ID                 identificador de mensaje
  3,285  Organization               organización
    591  Content-Disposition        disposición del contenido         
     22  X-cc                       copia al carbón (no estándar)

 * extensiones de correo de internet multipropósito

campos de encabezado

metadata mapping

mapeo de metadatos

 Date         ⇒ post/@when
 From         ⇒ post/@who
 Reply-To     ⇒ correspDesc/correspContext/ptr[@type="reply-to"]
 Sender       ⇒ correspDesc/correspAction[@type="relayed"]/email
 Subject      ⇒ post/head
 In-Reply-To  ⇒ correspDesc/correspContext/ref[@type="in-reply-to"]
 Comments [1] ⇒ correspDesc/correspAction[@type="orig-(to|cc)"]/email
 Message-ID   ⇒ idno[@type="message-id"]
 Organization ⇒ correspAction[@type="sent"]/orgName
 X-cc [1]     ⇒ correspDesc/correspAction[@type="orig-cc"]/email

[1, es] El campo X-cc: y la gran mayoría de los campos Comentarios:
        dan el valor del campo To: o CC: original.

[1, en] The X-cc: field and the vast majority of the Comments:
        field give the original To: or CC: field value.

Ejemplo de TEI

TEI Example

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0"
     xmlns:tmp="http://www.wwp.neu.edu/temp/ns"
     xmlns:xi="http://www.w3.org/2001/XInclude"
     n="00005"
     xml:id="TEI-L.txt_msg_00005">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>TEI-L posting </title>
         </titleStmt>
         <publicationStmt>
            <ab>Currently an experimental document not intended
            for publication. That said, the original source was
            published as part of a public mailing list, so this
            document similarly is publicly available.</ab>
         </publicationStmt>
         <sourceDesc>
            <ab type="desc">The source text file was (or at least should have been)
            a log file from a LISTSERV mailing list.</ab>
            <ab type="filepath">/tmp/LISTSERV_to_TEI/TEI-L.txt</ab>
         </sourceDesc>
      </fileDesc>
      <encodingDesc>
         <appInfo>
            <application ident="listserv_log2cmc.xslt" version="0.1">
               <desc>This file generated 2024-09-29T13:53:12.055127649-04:00 by file:/home/syd/Documents/tei-work/MM2024_Buenos_Aires/listserv_log2cmc.xslt,
              a program intended to convert LISTSERV logs (i.e., an archive of
              postings to a LISTSERV mailing list) to TEI, using /tmp/LISTSERV_to_TEI/TEI-L.txt as input.</desc>
            </application>
         </appInfo>
      </encodingDesc>
      <profileDesc>
         <correspDesc>
            <correspContext>
               <ptr type="reply-to" target="mailto:TEI-L@UICVM"/>
            </correspContext>
            <correspAction type="relayed">
               <email>TEI-L@UICVM</email>
               <date>Tue, 6 Feb 90 10:54:04 CST</date>
            </correspAction>
            <note type="Comments">"ACH / ACL / ALLC Text Encoding Initiative"</note>
         </correspDesc>
      </profileDesc>
      <xenoData>
         <tmp:Date>Tue, 6 Feb 90 10:54:04 CST</tmp:Date>
         <tmp:Reply-To>Text Encoding Initative public discussion list &lt;TEI-L@UICVM&gt;</tmp:Reply-To>
         <tmp:Sender>Text Encoding Initative public discussion list &lt;TEI-L@UICVM&gt;</tmp:Sender>
         <tmp:Comments>"ACH / ACL / ALLC Text Encoding Initiative"</tmp:Comments>
         <tmp:From>Michael Sperberg-McQueen 312 996-2477 -2981 &lt;U35395@UICVM.BITNET&gt;</tmp:From>
         <tmp:Subject>compound documents</tmp:Subject>
      </xenoData>
   </teiHeader>
   <text>
      <body>
         <post who="/tmp/LISTSERV_to_TEI/TEI-L.xml#u35395·@·uicvm.bitnet"
               when="1990-02-06T10:54:04-06:00">
            <head type="subject">compound documents</head>
          About compound documents in SGML and in the TEI.  R.P. Weber asked a
          <lb n="1"/>week ago "could someone please explain the TEI approach to compound
          <lb n="2"/>documents and images?  WIll SGML be used here, and if so, how?"
          <lb n="3"/>
            <lb n="4"/>Apologies for my delay in answering.  I was hoping one of our hypertext
          <lb n="5"/>sages might weigh in with a reply.  (But he appears to have been in the
          <lb n="6"/>Caribbean, and may not have received the query.)
          <lb n="7"/>
            <lb n="8"/>This problem has not, in fact, been discussed on this server, or as far
          <lb n="9"/>as I'm aware by the working committees.  So the formal answer is that no
          <lb n="10"/>decision has been cast in concrete yet.  Which allows me to turn the
          <lb n="11"/>tables and say "How *should* compound documents be encoded for
          <lb n="12"/>interchange?  What are the requirements?  What are the alternatives?"
          <lb n="13"/>
            <lb n="14"/>Less formally, I can offer some personal opinions, for what they are
          <lb n="15"/>worth (face value:  two cents -- 2u, for those of you on IBM mainframes
          <lb n="16"/>with real IBM terminals).
          <lb n="17"/>
            <lb n="18"/>Certainly SGML is where we should start in any search for ways of
          <lb n="19"/>handling compound objects, and I don't yet know any reason that SGML
          <lb n="20"/>won't provide a solution for the problem.  I assume there are two
          <lb n="21"/>methods of using SGML in compound documents (correct me if I'm wrong):
          <lb n="22"/>(1) use SGML to organize the compound document (i.e. have an SGML
          <lb n="23"/>document which includes text, images, sound, etc. as its components), or
          <lb n="24"/>(2) use whatever-you-like to organize the compound document as a whole,
          <lb n="25"/>and use SGML as the notation for the textual components of the compound
          <lb n="26"/>document.
          <lb n="27"/>
            <lb n="28"/>For the simple case of text-with-illustrations, SGML seems like a viable
          <lb n="29"/>encoding mechanism (for the envelope and for the text components) to me.
          <lb n="30"/>It allows you to encode the graphics however you like, declaring your
          <lb n="31"/>graphics format as a non-SGML notation and declaring the contents of
          <lb n="32"/>your graphics elements (say 'PICTURE' or 'BLORT') as being data in that
          <lb n="33"/>notation, stored either within the SGML file or externally to it.  You
          <lb n="34"/>get localization of the graphics within the text stream, integral or
          <lb n="35"/>separate storage of the graphics, and complete freedom to choose
          <lb n="36"/>whatever graphic notation you wish.
          <lb n="37"/>
            <lb n="38"/>As document encoding methods go, SGML is fairly hospitable to graphics
          <lb n="39"/>and other non-text pieces of compound objects.  Nowhere in the standard
          <lb n="40"/>does it say that the data have to be words and characters.  In fact, as
          <lb n="41"/>far as I know there is no *explicit* requirement in the standard that an
          <lb n="42"/>SGML document even has to be bytes in a computer.  (Sure, it's hard to
          <lb n="43"/>understand the standard any other way, but that's not the same as an
          <lb n="44"/>explicit requirement.)  ISO 8879 par. 6.1 note 1 says in fact "This
          <lb n="45"/>International Standard does not constrain the physical organization of
          <lb n="46"/>the document within the data stream, message handling protocol, file
          <lb n="47"/>system, etc., that contains it."  At the SGML '89 conference last
          <lb n="48"/>October in Atlanta, there was a very nice paper by Douglas MacLeod (read
          <lb n="49"/>by Yuri Rubinsky) thinking about architectural designs as SGML
          <lb n="50"/>documents, which led to a general discussion of SGML definitions for all
          <lb n="51"/>sorts of objects, including automobiles.  Although most people
          <lb n="52"/>(obviously) think of the SGML document as an electronic *description* of
          <lb n="53"/>the automobile (and the physical automobile as a side effect of
          <lb n="54"/>processing), it appears, in the light of the passage cited, hard to say
          <lb n="55"/>categorically that an automobile itself could never be parsed as an SGML
          <lb n="56"/>document.  (If you could figure out how to define the delimiters.)
          <lb n="57"/>
            <lb n="58"/>The only hitch is that the SGML standard itself (ISO 8879) does not
          <lb n="59"/>specify in any detail what the interface between SGML processors and
          <lb n="60"/>non-SGML processors must, may, or can look like -- an advantage, if you
          <lb n="61"/>will, in that it doesn't constrain anyone to an inappropriate model, but
          <lb n="62"/>a bit of a disadvantage in that most people don't have a clue what they
          <lb n="63"/>can now or will eventually or might someday be able to do with SGML and
          <lb n="64"/>graphics processors.
          <lb n="65"/>
            <lb n="66"/>Not being deeply involved in graphics work or compound documents myself,
          <lb n="67"/>I don't know off-hand what options are offered for this sort of thing by
          <lb n="68"/>existing SGML processors.  There will certainly be a fierce market
          <lb n="69"/>demand for it, not only from humanists but also (to our great advantage)
          <lb n="70"/>from the defense industry, which needs SGML support for technical
          <lb n="71"/>manuals with diagrams (and of course cross-references and other
          <lb n="72"/>hypertext mechanisms) and has the small change to pay for the
          <lb n="73"/>development costs.  (As long as they don't charge the humanists
          <lb n="74"/>defense-contractor prices!)
          <lb n="75"/>
            <lb n="76"/>If for some reason one does *not* want to use SGML as the envelope for
          <lb n="77"/>the entire compound document, then presumably the major requirement for
          <lb n="78"/>the text-components of the compound documents is that they be
          <lb n="79"/>computationally well-behaved, with a clearly defined structure, hooks
          <lb n="80"/>for pointers going out, and hooks for pointers coming in.  SGML
          <lb n="81"/>certainly has all of this, in its document type declarations and its ID
          <lb n="82"/>names and its IDREF pointers.
          <lb n="83"/>
            <lb n="84"/>Perhaps those subscribers to this list who actually work with compound
          <lb n="85"/>documents and SGML will be willing to say how they make things work now,
          <lb n="86"/>and how they would like to see things developing in the future.
          <lb n="87"/>
            <lb n="88"/>All this is, I repeat, just personal opinion and shouldn't be taken as
          <lb n="89"/>defining "the" position of the TEI.  (Unless, of course, taking as "the"
          <lb n="90"/>position will help get a discussion started.)
          <lb n="91"/>
            <lb n="92"/>-Michael Sperberg-McQueen
          <lb n="93"/> University of Illinois at Chicago</post>
      </body>
   </text>
</TEI>

teiHeader

TEI Listerv Original Server Location, Logfiles  =>  fileDesc/sourceDesc

Method of extraction => encodingDesc/samplingDecl

text/body

Monthly Log collection =>  text/body/div[@type="log"][@xml:id="LOG____"]

Individual E-mail message =>  div[@type="log"]/post

 

teiHeader

Ubicación del servidor original de TEI Listerv, archivos de registro =>  fileDesc/sourceDesc

 Método de extracción => encodingDesc/samplingDecl

text/body

Mensualmente colección de registros =>  text/body/div[@type="log"][@xml:id="LOG____"]

Mensaje de correo electrónico individual =>  div[@type="log"]/post

 

Mapeo de Metadatos: 1 archivo TEI por colección de <post>s

Metadata Mapping: 1 TEI file per collection of <post>s

Mapeo de Metadatos: 1 archivo TEI por colección de <post>s

Metadata Mapping: 1 TEI file per collection of <post>s

post/dateline

Date  =>  date/@when  | dateline/date/text()

Reply-To =>  ref[@type="reply-to"][@target="email:___]

Sender =>   ref[@generatedBy="system"][@type="sender"][@target="email:TEI-L@__]

From =>  ref[@generatedBy="template"][@type="from"][@target="email:___"]

Subject => title[@level="a"][@generatedBy="human"]

In-Reply-To => ref[@target="#id-of-earlier-posting]

Ejemplo de TEI <post>: 1 archivo TEI por colección de <post>s

TEI Example <post>: 1 TEI file per collection of <post>s

<post xml:id="Web-1990-01-31-0933">
    <dateline>
        <date when="1990-01-31">Wed, 31 Jan 90 09:33:26 CST</date>
        <ref generatedBy="system" type="reply-to" target="email:TEI-L@UICVM">Text Encoding
            Initative public discussion list </ref>
        <ref generatedBy="system" type="sender" target="email:TEI-L@UICVM">Text Encoding
            Initative public discussion list</ref>
        <ref generatedBy="template" type="from" target="email:WEBER@HARVARDA.BITNET">Robert
            Philip Weber</ref>
        <title level="a" type="subject" generatedBy="human">compound documents and
            images</title>
    </dateline>
    <p>could someone please explain the <name type="ML">TEI</name> approach to compound
        documents and images? WIll <name type="ML">SGML</name> be used here, and if so,
        how? I've just joined the list. sorry if this has been asked before.</p>
    <p>Many Thanks</p>
    <signed generatedBy="human">Bob Weber</signed>
    <signed generatedBy="template"> Robert Philip Weber, Ph.D. | Phone: (617) 495-3744
        <lb/>Senior Consultant | Fax: (617) 495-0750 <lb/>Academic and Planning Services |
        <lb/>Division | <lb/>Office For Information Technology| Internet:
        weber@popvax.harvard.edu <lb/>Harvard University | Bitnet: Weber@Harvarda <lb/>50
        Church Street | <lb/>Cambridge MA 02138 | </signed>
</post>

Ejemplo de TEI <teiHeader>: 1 archivo TEI por colección de <post>s

TEI Example <teiHeader>: 1 TEI file per collection of <post>s

   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Text Encoding Initiative public discussion list</title>
         </titleStmt>
         <editionStmt>
            <edition>January and February 1990 in the TEI Listserv
               <!-- TEI-L LOG9001, LOG9002 --></edition>
         </editionStmt>
         <publicationStmt>
            <!-- about the born-digital document -->
            <publisher>https://github.com/tei-cmc-experiment/tei-cmc-experiment</publisher>
         </publicationStmt>
         <sourceDesc>
            <bibl>
               <title level="j">TEI-L Listserv</title>
               <title level="s">LOG9001</title>
               <title level="s">LOG9002</title>
               <publisher>University of Illinois Chicago</publisher>
               <distributor>TEI-L@UICVM</distributor>
               <date>1990</date>
               <relatedItem type="archive">
                  <bibl>
                     <publisher>The Pennsylvania State University</publisher>
                     <distributor>LISTS.PSU.EDU LISTSERV Server (17.0)</distributor>
                     <date>2024</date>
                  </bibl>
               </relatedItem>
            </bibl>
         </sourceDesc>
      </fileDesc>
      <encodingDesc>
         <samplingDecl>
            <p>Sampled by requesting monthly logs from <name type="API">LISTS.PSU.EDU</name> by
               e-mail with GET commands: <code>GET TEI-L LOG 9001</code> (for January 1990). One log
               command was issued for each month. See <ptr type="APIdoc"
            target="https://www.lsoft.com/manuals/17.0/commands/14File-serverandwebfunctioncomma.html"/>. 
              Received by e-mail between <date from="2024-09-21" to="2024-09-22">September 21
                  and 22, 2024</date>.</p>
         </samplingDecl>
      </encodingDesc>
   </teiHeader>

Mapeo de Metadatos: 1 elemento TEI por 1 <post>

Metadata Mapping: 1 TEI element per 1 <post>

 

teiHeader/profileDesc/correspDesc

From =>  correspAction[@type="sent"]/persName

Date  =>  correspAction[@type="sent"]/date

Sender =>  correspAction[@type="relayed"]/orgName/ref[@target="email:TEI-L@__]

 

In-Reply-To => correspContext/ref[@type="in-response-to"][@target="#id-of-earlier-posting]

Subjet => teiHeader/fileDesc/titleStmt/title[@type="subjectLine"]

Mapeo de Metadatos: 1 elemento TEI por 1 <post>

Metadata Mapping: 1 TEI element per 1 <post>

<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <fileDesc>
        <!-- SAME AS PREVIOUS EXAMPLE . . . -->
      </fileDesc>
      <encodingDesc>
         <!-- SAME AS PREVIOUS EXAMPLE . . .  -->
      </encodingDesc>
   </teiHeader>
   <TEI xml:id="Web-1990-01-31-0933">
      <teiHeader>
         <fileDesc>
            <titleStmt>
               <title type="subjectLine">compound documents and images</title>
               <author>Robert Philip Weber</author>
            </titleStmt>
            <publicationStmt>
               <p>Text Encoding Initative public discussion list</p>
            </publicationStmt>
            <sourceDesc>
               <bibl><title level="s">LOG9001</title></bibl>
            </sourceDesc>
         </fileDesc>
         <profileDesc>
            <langUsage>
               <language ident="en">English</language>
            </langUsage>
            <correspDesc>
               <correspAction type="sent">
                  <persName>Robert Philip Weber</persName>
                  <email>WEBER@HARVARDA.BITNET</email>
                  <date when="1990-01-31">Wed, 31 Jan 90 09:33:26 CST</date>
               </correspAction>
               <correspAction type="relayed">
                  <orgName><ref target="mailto:TEI-L@UICVM">Text Encoding
                     Initative public discussion list</ref></orgName>
               </correspAction>            
            </correspDesc>
         </profileDesc>
      </teiHeader>
      <text>
         <body>
            <post>
               <p>could someone please explain the <name type="ML">TEI</name> approach to compound
                  documents and images? WIll <name type="ML">SGML</name> be used here, and if so,
                  how? I've just joined the list. sorry if this has been asked before.</p>
               <p>Many Thanks</p>
               <signed generatedBy="human">Bob Weber</signed>
               <signed generatedBy="template"> Robert Philip Weber, Ph.D. | Phone: (617) 495-3744
                  <lb/>Senior Consultant | Fax: (617) 495-0750 <lb/>Academic and Planning Services |
                  <lb/>Division | <lb/>Office For Information Technology| Internet:
                  weber@popvax.harvard.edu <lb/>Harvard University | Bitnet: Weber@Harvarda <lb/>50
                  Church Street | <lb/>Cambridge MA 02138 | </signed>               
            </post>
         </body>
      </text>
   </TEI>
   <TEI xml:id="Spe-1990-02-06-1054">
      <teiHeader>
         <fileDesc>
            <titleStmt>
               <title type="subjectLine">compound documents</title>
               <author>Michael Sperberg-McQueen</author>
            </titleStmt>
            <publicationStmt>
               <p>Text Encoding Initative public discussion list</p>
            </publicationStmt>
            <sourceDesc>
               <bibl><title level="a">LOG9002</title></bibl>
            </sourceDesc>
         </fileDesc>
         <profileDesc>
            <langUsage>
               <language ident="en">English</language>
            </langUsage>
            <correspDesc>
               <correspAction type="sent">
                  <persName>Michael Sperberg-McQueen</persName>
                  <email>U35395@UICVM.BITNET</email>
                  <date when="1990-02-06">Tue, 6 Feb 90 10:54:04 CST</date>
               </correspAction>
               <correspAction type="relayed">
                  <orgName><ref target="TEI-L@UICVM">Text Encoding
                     Initative public discussion list</ref></orgName>
               </correspAction>  
               <correspContext>
                  <ref type="in-response-to" target="#Web-1990-01-31-0933">previous message of 
                    <persName>Robert Philip Weber</persName> to the TEI-L Listserv.
                  <date when="1990-01-31"/>
                  </ref>
               </correspContext>
            </correspDesc>
         </profileDesc>
      </teiHeader>
      <text>
         <body>
            <post>
               <p>About compound documents in <name type="ML">SGML</name> and in the <name type="ML"
                  >TEI</name>. <ref target="#Web-1990-01-31-0933"><persName>R.P. Weber</persName>
                     asked a week ago <q>could someone please explain the <name type="ML">TEI</name>
                        approach to compound documents and images? WIll <name type="ML">SGML</name>
                        be used here, and if so, how?</q></ref></p>
               <p>Apologies for my delay in answering. I was hoping one of our hypertext sages might
                  weigh in with a reply. (But he appears to have been in the
                  <placeName>Caribbean</placeName>, and may not have received the query.)</p>
               <p>This problem has not, in fact, been discussed on this server, or as far as I'm
                  aware by the working committees. So the formal answer is that no decision has been
                  cast in concrete yet. Which allows me to turn the tables and say <q>How <emph
                     rend="*">should</emph> compound documents be encoded for interchange? What
                     are the requirements? What are the alternatives?</q></p>
               <p>Less formally, I can offer some personal opinions, for what they are worth (face
                  value: two cents -- 2u, for those of you on <objectName>IBM
                     mainframes</objectName> with real <objectName>IBM terminals</objectName>).</p>
               <p> Certainly <name type="ML">SGML</name> is where we should start in any search for
                  ways of handling compound objects, and I don't yet know any reason that SGML won't
                  provide a solution for the problem. I assume there are two methods of using <name
                     type="ML">SGML</name> in compound documents (correct me if I'm wrong): (1) use
                  <name type="ML">SGML</name> to organize the compound document (i.e. have an
                  SGML document which includes text, images, sound, etc. as its components), or (2)
                  use whatever-you-like to organize the compound document as a whole, and use <name
                     type="ML">SGML</name> as the notation for the textual components of the
                  compound document.</p>
               <p>For the simple case of text-with-illustrations, <name type="ML">SGML</name> seems
                  like a viable encoding mechanism (for the envelope and for the text components) to
                  me. It allows you to encode the graphics however you like, declaring your graphics
                  format as a non-<name type="ML">SGML</name> notation and declaring the contents of
                  your graphics elements (say 'PICTURE' or 'BLORT') as being data in that notation,
                  stored either within the <name type="ML">SGML</name> file or externally to it. You
                  get localization of the graphics within the text stream, integral or separate
                  storage of the graphics, and complete freedom to choose whatever graphic notation
                  you wish.</p>
               <p>As document encoding methods go, <name type="ML">SGML</name> is fairly hospitable
                  to graphics and other non-text pieces of compound objects. Nowhere in the standard
                  does it say that the data have to be words and characters. In fact, as far as I
                  know there is no <emph rend="*">explicit</emph> requirement in the standard that
                  an <name type="ML">SGML</name> document even has to be bytes in a computer. (Sure,
                  it's hard to understand the standard any other way, but that's not the same as an
                  explicit requirement.) <bibl>ISO 8879 par. 6.1 note 1</bibl> says in fact <q>This
                     International Standard does not constrain the physical organization of the
                     document within the data stream, message handling protocol, file system, etc.,
                     that contains it.</q> At the <eventName>SGML '89 conference</eventName> last
                  <date when="1989-10">October</date> in <placeName>Atlanta</placeName>, there
                  was a very nice paper by <persName>Douglas MacLeod</persName> (read by
                  <persName>Yuri Rubinsky</persName>) thinking about architectural designs as
                  <name type="ML">SGML</name> documents, which led to a general discussion of
                  <name type="ML">SGML</name> definitions for all sorts of objects, including
                  automobiles. Although most people (obviously) think of the <name type="ML"
                     >SGML</name> document as an electronic <emph rend="*">description</emph> of the
                  automobile (and the physical automobile as a side effect of processing), it
                  appears, in the light of the passage cited, hard to say categorically that an
                  automobile itself could never be parsed as an <name type="ML">SGML</name>
                  document. (If you could figure out how to define the delimiters.)</p>
               <p>The only hitch is that the <name type="ML">SGML</name> standard itself (<bibl
                  type="standard">ISO 8879</bibl>) does not specify in any detail what the
                  interface between <name type="ML">SGML</name> processors and non-<name type="ML"
                     >SGML</name> processors must, may, or can look like -- an advantage, if you
                  will, in that it doesn't constrain anyone to an inappropriate model, but a bit of
                  a disadvantage in that most people don't have a clue what they can now or will
                  eventually or might someday be able to do with SGML and graphics processors.</p>
               <p>Not being deeply involved in graphics work or compound documents myself, I don't
                  know off-hand what options are offered for this sort of thing by existing <name
                     type="ML">SGML</name> processors. There will certainly be a fierce market
                  demand for it, not only from humanists but also (to our great advantage) from the
                  defense industry, which needs <name type="ML">SGML</name> support for technical
                  manuals with diagrams (and of course cross-references and other hypertext
                  mechanisms) and has the small change to pay for the development costs. (As long as
                  they don't charge the humanists defense-contractor prices!)</p>
               <p>If for some reason one does <emph rend="*">not</emph> want to use <name type="ML"
                  >SGML</name> as the envelope for the entire compound document, then presumably
                  the major requirement for the text-components of the compound documents is that
                  they be computationally well-behaved, with a clearly defined structure, hooks for
                  pointers going out, and hooks for pointers coming in. SGML certainly has all of
                  this, in its document type declarations and its ID names and its IDREF
                  pointers.</p>
               <p>Perhaps those subscribers to this list who actually work with compound documents
                  and <name type="ML">SGML</name> will be willing to say how they make things work
                  now, and how they would like to see things developing in the future.</p>
               <p> All this is, I repeat, just personal opinion and shouldn't be taken as defining
                  <soCalled>the</soCalled> position of the <orgName>TEI</orgName>. (Unless, of
                  course, taking as <soCalled>the</soCalled> position will help get a discussion
                  started.)</p>
               <signed generatedBy="human">-Michael Sperberg-McQueen<lb/> University of Illinois at
                  Chicago </signed>
            </post>
         </body>
      </text>
      
   </TEI>
</TEI>

Improving the TEI for the encoding of born-digital resources?

mejorando la TEI para la codificación de recursos de origen digital

  • TEI provides no good way to encode an e-mail Subject line in either <correspDesc> or <dateline>

    • <head>, <label> not allowed and are not ideal anyway

    • <title> not quite apt

    • Could we put <head> in <post>?

      • Yes, it is allowed, and it does make sense

      • No: it is part of the metadata from email heading section, not part of the e-mail body

  • TEI no proporciona ninguna buena manera de codificar una línea de sujeto de un correo en <correspDesc> ni en <dateline>

    • <head>, <label> no se permiten ir de todos modos no son ideales

    • <title> no es enteramente apto

    • Could we put <head> in <post>?

      •  Sí, se permite, y tiene sentido

      • No: es parte de los metadatos de la sección cabecera, no parte del cuerpo del correo

Campo de cabecera sujeto

Subject header field

overly constrained?

  • e-mail addresses in <person>:

    • <email> not allowed directly in <person>

    • ​<email> is poorly defined

  • <post> allows <lb/> but not <line>

¿excesivamente restringido?

  • direcciones de correo en <person>:

    • <email> no se permite directamente en <person>

    • ​<email> está mal definido

  • <post> permite <lb/> pero no <line>

 

  • What @level values are appropriate for the <title>s of:

    • TEI-L Listserv as a whole?

    • LOGYYMM files?

    • <post> elements?

    • Do we need new values of @level for born-digital compound artifacts?

 

  • title[@level="s"]: (series)

  • title[@level="j"]: (journal)

  • title[@level="m"]:(monographic)

  • title[@level="a"]: (analytic)

 

  • ¿Qué valores de @level son apropiados para los <title> de:

    • ¿TEI-L Listserv en su conjunto?

    • ¿Archivos LOGYYMM?

    • Elementos <post>?

    • ¿Necesitamos nuevos valores de @level para artefactos de origen digital compuestos?

Our Experiments with Scraping, Analyzing, and Coding

Nuestros experimentos con scraping, análisis y codificación

any questions?

¿algunas preguntas?

slide dungeon

Attempt 1: Scrapy Spider with Python

  • Source: public TEI-L web archive, selected via URL: 

    https://lists.psu.edu/cgi-bin/wa?A1=199001-199512&L=TEI-L
    
  • good possibilities for directly outputting XML

  • Can define the XML output structure from message headers and HTML element structure

  • Works with XPath 2.0 to retrieve HTML element contents

Intento 1: Scrapy Spider con Python

  • Fuente: archivo web público de TEI-L, seleccionado a través de URL: 

    https://lists.psu.edu/cgi-bin/wa?A1=199001-199512&L=TEI-L
    
  • Buenas posibilidades para generar directamente XML

  • Puede definir la estructura de salida XML a partir de encabezados de mensajes y estructura de elementos HTML

  • Funciona con XPath 2.0 para recuperar contenidos de elementos HTML

Desired output: Scrapy Spider with Python 

<?xml version="1.0" encoding="utf-8"?>
<emails>
  <email>
    <emailId>ef5ba1b</emailId>
    <date>Tue, 5 Sep 1995 10:39:24 CDT</date>
    <reply-to>Text Encoding Initiative public discussion list 
         &lt;TEI-L@UICVM.UIC.EDU&gt;</reply-to>
    <from>Arjan.Loeffen@LET.RUU.NL</from> 
    <subject>DYNATEXT and TEI: level of support?</subject>
    <body>Dear reader, 
    Peter Robinson writes in 'The transcription...' [...]</body>
  </email>
  <email>
    <emailId>cd018892</emailId>
    <date>Wed, 6 Sep 1995 11:04:31 CDT</date>
    <reply-to>Text Encoding Initiative public discussion list 
         &lt;TEI-L@UICVM.UIC.EDU&gt;</reply-to>
    <from>"Steven J. DeRose" &lt;sjd@ebt.com&gt;</sender>
    <subject>Re: DYNATEXT and TEI: level of support?</subject>
    <body>Peter Robinson writes in 'The transcription...' [...]
      Caveat for anyone new to the list: I *am* connected to DynaText, [...]
      </body>
  </email>
</emails>

Resulto deseado: Scrapy Spider con Python

Method: Scrapy Spider with Python

Método: Scrapy Spider con Python

  • This could work well:

    • We could convert the elements into TEI and model with CMC and Correspondence encoding

    • TEI-L web archive delivers the distinct identifier of the message in its URL: https://lists.psu.edu/cgi-bin/wa?A2=TEI-L;cd018892.9509

    • It most likely automatically converts base64 content

  • Esto podría funcionar bien:

    • Podríamos convertir los elementos a TEI y modelar con los módulos de CMC y Correspondencia

    • El archivo web TEI-L proporciona el identificador único del mensaje en su URL: https://lists.psu.edu/cgi-bin/wa?A2=TEI-L;cd018892.9509

    • Lo más probable es que convierta automáticamente el contenido base64

Method: Scrapy Spider with Python

  • Problems: Dense scripting on TEI-L archive webpages!

    • Sender name / email information is set in the same HTML element content, and not reliably separated

    • Successfully extracted URLs for messages from the TEI-L web archive, but parsing the messages failed. (Abandoned for the conference due to time constraints, but we'd like to return to it. . .)

Método: Scrapy Spider con Python

  • Problemas: ¡Secuencias de comandos densas en las páginas web de archivos TEI-L!

    • El nombre del remitente y la información de correo electrónico se establecen en el mismo contenido del elemento HTML y no siempre están separados

    • Se extrajeron correctamente las URL de los mensajes del archivo web de TEI-L, pero falló el análisis de los mensajes. (Abandonado para la conferencia por falta de tiempo, pero nos gustaría volver al tema...)

Simpler method: Request log files from TEI-L

Método más simple: solicitar archivos de registro de TEI-L

  • "Scraping" extracts data from a published web page, digging into a labyrinth of code.

  • Accessing log files is much simpler and more direct.

  • Sent a series of "GET" commands by e-mail to the listserv, and it responds with monthly log files.

  • While Syd worked with all 412 log files, Elisa just pulled some "core samples" from early in the archive to study.

  • "Scraping" extrae datos de una página web publicada, entrando en un laberinto de código.

  • Acceder a los archivos de registro es mucho más sencillo y directo.

  • Se enviaron una serie de comandos "GET" por correo electrónico al servidor de listas y responde con archivos de registro mensuales.

  • Mientras Syd trabajaba con los 412 archivos de registro, Elisa simplemente extrajo algunas "muestras principales" del comienzo del registro para estudiarlas.