James Cummings
@jamescummings
http://slides.com/jamescummings/markup-xml-tei
Thanks as ever to many members of the TEI Community
(How we tell the computer about text)
Markup is used in many different fields, for many different purposes: storing data, relating information, encoding understanding, preserving metadata
Procedural Markup:
RED INK ON; print "-£1000"; RED INK OFF
Presentational Markup:
\textcolor{red}{-£1000}
Descriptive Markup:
<
measure
unit="
pounds"
value="
-1000">
One thousand pounds in debt
</
measure>
Think about the uses for an italic font in any form of printed publication. Why might an author/publisher put some text into italics? What are they signalling about that text?
We can usually tell these types of things apart from context. If we want to use these categories, computers need to be told these things are different.
Some common uses include:
... and many more
<hi rend="dropcap">H</hi>
<g ref="#wynn">W</g>ÆT WE GARDE
<lb/>na in gear-dagum þeod-cyninga
<lb/>þrym gefrunon, hu ða æþelingas
<lb/>ellen fremedon. oft scyld scefing sceaþe<add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl<add>a</add>
<lb/>of<damage><desc>blot</desc></damage>teah ...
<lg>
<l>Hwæt! we Gar-dena in gear-dagum</l>
<l>þeod-cyninga þrym gefrunon,</l>
<l>hu ða æþelingas ellen fremedon,</l>
</lg>
<lg>
<l>Oft Scyld Scefing sceaþena þreatum,</l>
<l>monegum mægþum meodo-setla ofteah;</l>
<l>egsode Eorle, syððan ærest wearþ</l>
<l>feasceaft funden...</l>
</lg>
(A language for marking up texts)
XML is structured data represented as strings of text
XML looks like HTML, except that:
<element> Text </element>
<element attribute="value">
Text or child elements here
</element>
<element attribute="value"/>
"Opening Tag"
"Closing Tag"
"Empty Element"
<?xml version="1.0" ?>
<root xmlns="http://namespace/">
<element attribute="value">
content
<childElement type="empty"/>
content
</element>
<!-- comment -->
</root>
<?xml version="1.0" encoding="utf-8" ?>
<div n="1">
<head>SCENE I. On a ship at sea: a
tempestuous noise of thunder and lightning heard.</head>
<stage>Enter a Master and a Boatswain</stage>
<sp>
<speaker>Master</speaker>
<ab>Boatswain!</ab>
</sp>
<sp>
<speaker>Boatswain</speaker>
<ab>Here, master: what cheer?</ab>
</sp>
<sp>
<speaker>Master</speaker>
<ab>Good, speak to the mariners: fall to't, yarely,</ab>
<ab>or we run ourselves aground: bestir, bestir.</ab>
</sp>
<stage>Exit</stage>
</div>
You can also be 'valid' which means you obey additional rules about elements and attributes and where they can go.
<seg> <w>some</w> <hi>text</hi> </seg>
(A markup vocabulary for digital texts)
(What kinds of texts is the TEI good for?)
The TEI takes a generalistic approach to overall text structure and this means it should be able to cope with texts of any size, language, date, complexity, writing system, or media.
This could be in any form: books, journals, manuscripts, postcards, letters, rolls of papyrus, clay tablets, web pages, gravestones, etc. and contain any type of text.
Punch Magazine: a variety of content forms
Holinshed's Chronicles: columns, marginal notes, woodcuts
First Folio:
forme-work, catchwords, decorative initials, etc.
Wilfred Owen: manuscripts, corrections, multiple versions
George Herbert: Graphic text layout, poetry
William Godwin's Diary: diary structure, abbreviated texts
Wilfred Owen: Letters, codewords
Print and Digital Dictionaries: entries, sense, etymologies, quotations, etc.
Epigraphical Texts: partial letters, supplied text, physical description
WW1 Propaganda: font, colour, glyph substitution, image classification and metadata
Various writing systems: Unicode/non-Unicode characters, right-to-left, reversing lines, etc.
English Civil War Petitions: handwritten petitions, formal legal aspects, boilerplate
Thinking about this material, and indeed your own, what do you think are the things you would like to mark up?
Pretend an authoritarian anti-intellectual government has come to power and, through a series of bad decisions, has to slash your project funding by 50%. What do you do?
Repeat the exercise.