deck

Models and methodologies to represent

the digital text.

Introduction to TEI, methods and languages to data management and display

Tiziana Mancinelli

Scholarly Editing and the Media Shift – Procedures and Theory

Verona 8/9 settembre 2015

Aims of this workshop:

Introduce the concept of markup and XML encoding

Provide hands-on experience in using TEI-XML markup

Overview of XML-related technologies

Textual Markup

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts.

When we talk about text encoding, what do we mean by a text?

What is in a text and what assumptions do we make in reading them?

A text is not a document

- A ”document” is something that exists in the world, which we can digitize.

- A ”text” is an abstraction, created by or for a community of readers, which we can encode.

Encoding of texts...

A text is more than a sequence of encoded glyphs or lexical tokens

It has a structure and a communicative function

It also has multiple possible readings

Encoding, or markup, is a way of making all these things explicit

Why do we have to mark-up?

To make explicit (to a machine) what is implicit (to a person)

To add value by supplying multiple annotations

To facilitate re-use of the same material

In different formats, in different contexts by different users

Separation of form and content

Presentational markup cares more about fonts and layout than

meaning

Descriptive markup says what things are, and leaves the

rendition of them for a separate step

It also allows easy changes of presentation across a large

number of documents

Markup as a scholarly activity

The application of markup to a document can be an

intellectual activity

In deciding what markup to apply, and how this represents the

original, one is undertaking the task of an editor

There is (almost) no such thing as neutral markup – all of it

involves interpretation

Text

Markup can assist in answering research questions, and the

deciding what markup is needed to enable such questions to

be answered can be a research activity in itself

Markup as a scholarly activity

Imagine you are going to markup several thousand pages of

complex material....

Which features are you going to markup?

Why are you choosing to markup this feature?

How reliably and consistently can you do this?

Now, imagine your budget has been halved.

XML

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensible role in the exchange of a wide variety of data on the Web, on our phones, and elsewhere.

Its success means that general tools are ubiquitous and how it

works is well-understood.

XML

XML is structured data represented as strings of text

XML looks like HTML, except that:-

XML is extensible

XML must be well-formed

XML can be validated

XML is application-, platform-, and vendor- independent

XML empowers the content provider and facilitates data

integration

It is one of the best plain text long-term preservation formats

for textual data that we have

You use XML almost every day, never mind the web but in

many devices or even derived analogue information sources

XML

An XML document may contain:

- elements, possibly bearing attributes

- processing instructions

- comments

- entity references

- namespaces

An XML document must be well-formed and may be valid

XML

<?xml version="1.0" ?>

<root>

<element attribute="value">I like encoding!</element>

</root>

XML

<?xml version="1.0" ?>

<root>

<element attribute="value">I like encoding!</element>

</root>

XML

- An XML document represents a (kind of) tree

- It has a single root and many nodes

- Each node can be a subtree a single element

(possibly bearing some attributes)

- a string of character data

- Each element has a name or generic identifier

- XML elements and attributes are case sensitive

XML

- An XML document represents a (kind of) tree

- It has a single root and many nodes

- Each node can be a subtree a single element

(possibly bearing some attributes)

- a string of character data

- Each element has a name or generic identifier

- XML elements and attributes are case sensitive

XML

- An XML document is encoded as a linear string of characters

It begins with a special processing instruction

- Element occurrences are marked by start and end-tags

- The characters < and & are Magic

and must always be ”escaped” using < or & if you want to use them as themselves

- Comments are delimited by

- Attribute name/value pairs are supplied on the start-tag and

may be given in any order

- Entity references are delimited by & and ;

XML

- An XML document is encoded as a linear string of characters

It begins with a special processing instruction

- Element occurrences are marked by start and end-tags

- The characters < and & are Magic

and must always be ”escaped” using < or & if you want to use them as themselves

- Comments are delimited by

- Attribute name/value pairs are supplied on the start-tag and

may be given in any order

- Entity references are delimited by & and ;

XML

The XML declaration
Namespace declarations
The root element of the document itself
Other elements and content
Attribute and value

<?xml version="1.0"?>

<hello type="ironic">hello world!</hello>

</greetings>

The XML declaration

An XML document must begin with an XML declaration

which does three things:

- specifies that this is an XML document

- specifies which version of the XML standard it follows

- specifies which character encoding the document uses

<?xml version="1.0"?>

<?xml version="1.0" encoding="iso-8859-1" ?>

The XML declaration

An XML document must begin with an XML declaration

which does three things:

- specifies that this is an XML document

- specifies which version of the XML standard it follows

- specifies which character encoding the document uses

<?xml version="1.0"?>

<?xml version="1.0" encoding="iso-8859-1" ?>

The XML namespace

All TEI documents are declared within the TEI namespace:

<TEI xmlns="http://www.tei-c.org/ns/1.0"> ...

</TEI> XML documents can include elements declared in different

name spaces.

- a namespace declaration associates a namespace prefix with

an external URI-like identifier the default namespace may be declared using a xmlns other name spaces must all use a specially declared prefix

The xml namespace is used by the TEI for global attributes

@xml:id and @xml:lang

The XML SCHEMA

An XML Schema is a language for expressing constraints about XML documents. There are several different schema languages in widespread use, but the main ones are Document Type Definitions (DTDs), Relax-NG, Schematron and W3C XSD (XML Schema Definitions).

XML

XML - SCHEMA

TEI - Text Encoding Initiative

The Text Encoding Initiative Consortium is an international organization whose mission is to develop and maintain guidelines for the digital encoding of literary and linguistic texts. The Consortium publishes the Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange: an international and interdisciplinary standard that is widely used by libraries, museums, publishers, and individual scholars to represent all kinds of textual material for online research and teaching.

TEI - Text Encoding Initiative

www.tei-c.org

Guidelines

Tools

TEI - Text Encoding Initiative

<TEI>

<text>

TEI - Text Encoding Initiative

<body>

<head>

<lg>

<l>

<p>

TEI - Text Encoding Initiative

XML syntax: the small print

What does it mean to be well-formed

There is a single root node containing the whole of an XML

document

Each subtree is properly nested within the root node

Element/attribute/etc. names are always case sensitive

Start-tags and end-tags are always mandatory except there

can be combined start-and-end tags for certain elements:

TEI - Text Encoding Initiative

The TEI is not (just) a schema!

It is:

an international consortium supported by libraries and universities with stable large open source community a set of definitions, examples and discussion of several hundred useful and mostly textual distinctions a set of regularly maintained and updated recommendations:

’The TEI Guidelines’ a set of customizable tools and stylesheets for transformations to/from many formats (e.g. HTML, Word, PDF, Databases, RDF/Linked Data, Slides, ePub, Schemas, etc.) an archivally well-understood, consensus-based way of organizing and structuring textual (and other) resources an evolving history of the concerns of the digital humanities community whatever you make it... it is a community-developed standard.