Models and methodologies to represent
the digital text.
Introduction to TEI, methods and languages to data management and display
Tiziana Mancinelli
Scholarly Editing and the Media Shift – Procedures and Theory
Verona 8/9 settembre 2015
Aims of this workshop:
Introduce the concept of markup and XML encoding
Provide hands-on experience in using TEI-XML markup
Overview of XML-related technologies
Textual Markup
In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts.
When we talk about text encoding, what do we mean by a text?
What is in a text and what assumptions do we make in reading them?
A text is not a document
- A ”document” is something that exists in the world, which we can digitize.
- A ”text” is an abstraction, created by or for a community of readers, which we can encode.
Encoding of texts...
A text is more than a sequence of encoded glyphs or lexical tokens
It has a structure and a communicative function
It also has multiple possible readings
Encoding, or markup, is a way of making all these things explicit
Why do we have to mark-up?
To make explicit (to a machine) what is implicit (to a person)
To add value by supplying multiple annotations
To facilitate re-use of the same material
In different formats, in different contexts by different users
Separation of form and content
Presentational markup cares more about fonts and layout than
meaning
Descriptive markup says what things are, and leaves the
rendition of them for a separate step
It also allows easy changes of presentation across a large
number of documents
Markup as a scholarly activity
The application of markup to a document can be an
intellectual activity
In deciding what markup to apply, and how this represents the
original, one is undertaking the task of an editor
There is (almost) no such thing as neutral markup – all of it
involves interpretation
Text
Markup can assist in answering research questions, and the
deciding what markup is needed to enable such questions to
be answered can be a research activity in itself
Markup as a scholarly activity
Imagine you are going to markup several thousand pages of
complex material....
Which features are you going to markup?
Why are you choosing to markup this feature?
How reliably and consistently can you do this?
Now, imagine your budget has been halved.
XML
Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensible role in the exchange of a wide variety of data on the Web, on our phones, and elsewhere.
Its success means that general tools are ubiquitous and how it
works is well-understood.
XML
XML is structured data represented as strings of text
XML looks like HTML, except that:-
XML is extensible
XML must be well-formed
XML can be validated
XML is application-, platform-, and vendor- independent
XML empowers the content provider and facilitates data
integration
It is one of the best plain text long-term preservation formats
for textual data that we have
You use XML almost every day, never mind the web but in
many devices or even derived analogue information sources
XML
An XML document may contain:
- elements, possibly bearing attributes
- processing instructions
- comments
- entity references
- namespaces
An XML document must be well-formed and may be valid
XML
<?xml version="1.0" ?>
<root>
<element attribute="value">I like encoding!</element>
<!-- this is a comment -->
</root>
XML
<?xml version="1.0" ?>
<root>
<element attribute="value">I like encoding!</element>
<!-- this is a comment -->
</root>
XML
- An XML document represents a (kind of) tree
- It has a single root and many nodes
- Each node can be a subtree a single element
(possibly bearing some attributes)
- a string of character data
- Each element has a name or generic identifier
- XML elements and attributes are case sensitive
XML
- An XML document represents a (kind of) tree
- It has a single root and many nodes
- Each node can be a subtree a single element
(possibly bearing some attributes)
- a string of character data
- Each element has a name or generic identifier
- XML elements and attributes are case sensitive
XML
- An XML document is encoded as a linear string of characters
It begins with a special processing instruction
- Element occurrences are marked by start and end-tags
- The characters < and & are Magic
and must always be ”escaped” using < or & if you want to use them as themselves
- Comments are delimited by
<!-- and -->
- Attribute name/value pairs are supplied on the start-tag and
may be given in any order
- Entity references are delimited by & and ;
XML
- An XML document is encoded as a linear string of characters
It begins with a special processing instruction
- Element occurrences are marked by start and end-tags
- The characters < and & are Magic
and must always be ”escaped” using < or & if you want to use them as themselves
- Comments are delimited by
<!-- and -->
- Attribute name/value pairs are supplied on the start-tag and
may be given in any order
- Entity references are delimited by & and ;
XML
- The XML declaration
- Namespace declarations
- The root element of the document itself
- Other elements and content
- Attribute and value
<?xml version="1.0"?>
<greetings xmlns="http://www.example.org/greetings">
<hello type="ironic">hello world!</hello>
</greetings>
The XML declaration
An XML document must begin with an XML declaration
which does three things:
- specifies that this is an XML document
- specifies which version of the XML standard it follows
- specifies which character encoding the document uses
<?xml version="1.0"?>
<?xml version="1.0" encoding="iso-8859-1" ?>
The XML declaration
An XML document must begin with an XML declaration
which does three things:
- specifies that this is an XML document
- specifies which version of the XML standard it follows
- specifies which character encoding the document uses
<?xml version="1.0"?>
<?xml version="1.0" encoding="iso-8859-1" ?>
The XML namespace
All TEI documents are declared within the TEI namespace:
<TEI xmlns="http://www.tei-c.org/ns/1.0"> ...
</TEI> XML documents can include elements declared in different
name spaces.
- a namespace declaration associates a namespace prefix with
an external URI-like identifier the default namespace may be declared using a xmlns other name spaces must all use a specially declared prefix
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:math="http://www.mathml.org">
<p>...<math:expr>...</math:expr>...</p>...</TEI>
The xml namespace is used by the TEI for global attributes
@xml:id and @xml:lang
The XML SCHEMA
An XML Schema is a language for expressing constraints about XML documents. There are several different schema languages in widespread use, but the main ones are Document Type Definitions (DTDs), Relax-NG, Schematron and W3C XSD (XML Schema Definitions).
XML
XML - SCHEMA
TEI - Text Encoding Initiative
The Text Encoding Initiative Consortium is an international organization whose mission is to develop and maintain guidelines for the digital encoding of literary and linguistic texts. The Consortium publishes the Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange: an international and interdisciplinary standard that is widely used by libraries, museums, publishers, and individual scholars to represent all kinds of textual material for online research and teaching.
TEI - Text Encoding Initiative
www.tei-c.org
Guidelines
Tools
TEI - Text Encoding Initiative
<TEI>
<TeiHeader>
<text>
TEI - Text Encoding Initiative
<body>
<head>
<lg>
<opener>
<l>
<l>
<l>
<p>
TEI - Text Encoding Initiative
XML syntax: the small print
What does it mean to be well-formed
There is a single root node containing the whole of an XML
document
Each subtree is properly nested within the root node
Element/attribute/etc. names are always case sensitive
Start-tags and end-tags are always mandatory except there
can be combined start-and-end tags for certain elements:
TEI - Text Encoding Initiative
The TEI is not (just) a schema!
It is:
an international consortium supported by libraries and universities with stable large open source community a set of definitions, examples and discussion of several hundred useful and mostly textual distinctions a set of regularly maintained and updated recommendations:
’The TEI Guidelines’ a set of customizable tools and stylesheets for transformations to/from many formats (e.g. HTML, Word, PDF, Databases, RDF/Linked Data, Slides, ePub, Schemas, etc.) an archivally well-understood, consensus-based way of organizing and structuring textual (and other) resources an evolving history of the concerns of the digital humanities community whatever you make it... it is a community-developed standard.
deck
By tiziana_m
deck
- 892