TEI Metadata

Dr James Cummings

@jamescummings

http://slides.com/jamescummings/tei-metadata

What is Metadata?

  • often called "data about data"
  • term originally used only with electronic data but its meaning has broadened
  • data about the content, context, and structure of information resources
  • the catalogue record of the data/text/edition

Some examples: 

  • Purpose of the data
  • Means of creation of the data
  • Time and date of creation
  • Creator or author of the data
  • Where the data was created
  • Standards used in creating the data
  • Size of the data in useful units
  • Related or supplemental data
  • Last revision date of the data
  • Stage of production of the data

TEI Metadata

  • TEI requires some of its metadata to be stored inside the XML document, prefixed to the content.
  • This information comprises the TEI header although some can be included inside the <body> or pointed to outside the document. It is: 
    • used to store bibliographical information about both the electronic version(s) of the text as well as any physical, or analogue, source(s)
    • basic information is similar to library cataloguing and supports interroperability with other metadata standards
    • much like an electronic version of a title page attached to a printed work

The <teiHeader>

  • The TEI header was designed with two goals in mind:

    • needs of bibliographers and librarians trying to document what were called 'electronic books'

    • needs of text analysts and digital editors trying to document ‘coding practices’ within digital resources

  • The result is that discussion of the header tends to be pulled in two directions...

  • Where can I read about this?

    • Chapter 2: The TEI Header

    • Chapter 10: Manuscript Description

Librarian's Header

  • Conforms to standard bibliographic models
  • Easily mapped to METS/EAD/MARC and other library metadata formats
  • Based on TEI for Libraries Special Interest Group
  • Pressure for more specific constraints
  • Prefers structured data over loose prose

Editor's Header

  • Polite nod to bibliographic practices
  • Supports (potentially) huge range of miscellaneous information
  • Different codes of practice in different communities
  •  Often concerned with editorial principles
  • Mixture of tightly controlled and lose prose

Most headers are somewhere between the two

<teiHeader>

Structure of a <teiHeader>

The TEI header has four main components:

  • <fileDesc> (file description) contains a full bibliographic description of the file 
  • <encodingDesc> (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived
  • <profileDesc> (text-profile description) provides a detailed description of non-bibliographic aspects of a text, the languages and sublanguages used, the situation in which it was produced, the participants and their setting
  • <revisionDesc> (revision description) summarizes the revision history for a file

Only <fileDesc> is required -- the others are optional!

A Minimal Header

  • fileDesc (with titleStmt & title), publicationStmt, and sourceDesc are all that are required

Two Levels of Header

<teiHeader>: Required vs Optional Components

/TEI/teiHeader/fileDesc

The <fileDesc> element has some mandatory elements:

  • <titleStmt>: provides a title for the resource and any associated statements of responsibility
  • <sourceDesc>: documents the sources from which the encoded text derives (if any)
  • <publicationStmt>: documents how the encoded text is published or distributed

and some optional ones such as:

  • <editionStmt>: yes, digital texts have editions too
  • <seriesStmt>: and they also t into "series"
  • <extent>: how many words, gigabytes, volumes, files?
  • <notesStmt>: notes of various types

More About <fileDesc>

  • <titleStmt>: contains a mandatory <title> which identifies the electronic file (not its source!)
  • optionally followed by additional titles, and by ‘statements of responsibility’, as appropriate, using <author>, <editor>, <sponsor>, <funder>, <principal> or the generic <respStmt>
  • <publicationStmt>: may contain
  • <p> to give prose (e.g. to say the text is unpublished) or
  • one or more <publisher>, <distributor>, <authority>,  each followed by <pubPlace>, <address>, <availability>, <idno> etc.

/TEI/teiHeader/fileDesc/titleStmt

/TEI/teiHeader/fileDesc/publicationStmt

  • Mandatory element
  • At least one of <publisher>,<distributor> and/or <authority> must be present unless the entire
  • publication statement is given as prose paragraphs using <p>
  • If the creation date is different than the date of publication, creation date could be given within <profileDesc>, not in the <publicationStmt>
  • A formal license may be entered in <licence> included in <availability>

Example <publicationStmt>

/TEI/teiHeader/fileDesc/notesStmt

The optional <notesStmt> can contain notes on almost any aspect of the file or its contents:

  • These notes can be short statements, or many parargaphs long.
  • Where possible, take care to encode such information with more precise elements elsewhere in the TEI header
  •  For example, text types, such as 'reportage' or 'detective fiction', should be described under <profileDesc>

/TEI/teiHeader/fileDesc/sourceDesc

All electronic works need to document their source,
even 'born digital' ones! The <sourceDesc> can have:

  • prose description, just a <p>
  • <bibl> (bibliographic citation): contains free text and/or any mixture of bibliographic elements such as <author>, <publisher> etc.
  • <biblStruct> (structured) contains similar elements but constrained in various ways according to bibliographic standards
  • A <listBibl> may be used for lists of such descriptions, e.g. bibliographies
  • Specialised elements for spoken texts (<recordingStmt> etc.) and for manuscripts (<msDesc>) 
  • Authority lists: <listPerson>, <listPlace>, <listOrg> if not storing elsewhere

Example <sourceDesc>

Or your <sourceDesc> could have one or more <msDesc> elements

/TEI/teiHeader/encodingDesc

<encodingDesc> groups notes about the procedures used when the text was encoded, either summarised in prose or within specific elements such as

  • <projectDesc>: goals of the project
  • <samplingDecl>: sampling principles
  • <editorialDecl>: editorial principals,
    • e.g. <correction>, <hyphenation>, <interpretation>, <normalization>, <punctuation>, <quotation>, <segmentation>
  • <classDecl>: classification system/s used
  • <tagsDecl>: specifics about usage of particular elements

Detailed notes in <encodingDesc> could be used to generate a section of an editorial description.

Example <encodingDesc>

/TEI/teiHeader/encodingDesc/classDecl

/TEI/teiHeader/encodingDesc/tagsDecl

  • <tagsDecl> records elements namespace, tag frequency, information about the usage of particular tags not specified elsewhere, and default rendition of the text in the source.
  • <rendition> structured information about appearance in the source document

/TEI/teiHeader/profileDesc

The <profileDesc> contains a collection of descriptions, categorised only as ‘non-bibliographic’. Default members of the model.profileDescPart class include:

  • <creation>: information about the origination of the intellectual content of the text, e.g. time and place
  • <langUsage>: information about languages, registers, writing systems etc used in the text
  • <textDesc> and <textClass>: classifications applied to the text by means of a list of specified criteria or by means of a collection of pointers 
  • <particDesc> and <settingDesc>: information about the ‘participants’, either real or depicted, in the text
  • <handNotes>: information about the particular style or hand distinguished within a manuscript when not giving full manuscript description

/TEI/teiHeader/profileDesc/creation (& particDesc)

/TEI/teiHeader/profileDesc/langUsage

The <langUsage> element is provided to document usage of languages and writing systems in the text. Languages are identified by their ISO codes:

/TEI/teiHeader/profileDesc/textDesc

<textDesc> provides a description of a text in terms of its 'Situational parameters', a description of the situation within which the text was produced or experienced.

/TEI/teiHeader/revisionDesc

  • Inside <revisionDesc> you find list of <change> elements,  usually each with a @date and @who attributes, indicating significant stages in the evolution of a document.
  • Conventionally, the most recent change is given first.
  • Can be given in a <listChange> elements. Used here it is about the electronic file, used in <creation> it is about the stages of textual production.
  • Can be maintained manually, or done by means of a version control system (like Subversion or Git)

Manuscript Description

About <msDesc>

The TEI <msDesc> element is intended for several different kinds of applications:

  • standalone database of library records (finding aid)
  • discursive text collecting many records (catalogue raisonné)
  • metadata component within a digital surrogate (electronic edition)
  • tool for ‘quantitative codicology’

 

Manuscript description in the TEI caters for two conflicting desires:

  • preserve (or perpetuate) existing descriptive prose
  • reliable search, retrieval, and analysis of data

The <msDesc> tries, wherever possible, to enable both of these approaches.

Inside <msDesc>

  • One or more <p> paragraphs or more structured elements:
    • <msIdentifier>: information identifying this manuscript
    • <msContents>: a list of the intellectual content of the manuscript
    • <physDesc>: groups information concerning all physical aspects of the manuscript
    • <history>: provides information on the history of the manuscript, its origin, provenance and acquisition by current holding institution
    • <additional>: groups other information about the manuscript (e.g. administrative information relating to its availability, custodial history, surrogates)
    • <msPart>: parts of a composite manuscript 
    • <msFrag>: fragments of a scattered manuscript 

msDesc

msDesc

msDesc/msIdentifier

The <msIdentifier> element has a traditional manuscript location three part specification:

  • place: <country>, <region>, <settlement>
  • repository: <institution>, <repository>
  • identifier: <collection>, <idno>, <altIdentifier>

msDesc/msIdentifier

msDesc/msContents

The <msContents> element contains information about the intellectual content of the manuscript. Multiple <msItem> elements provide a detailed table of contents

Example <msContents>

msDesc/physDesc

The <physDesc> element records any information concerning the physicality or materiality of the manuscript.

If using the structured form this might include:

  • The physical carrier: <objectDesc>
  • What it carries: <handDesc>, <scriptDesc>, <typeDesc>
  • Special features: <additions>, <decoDesc>, <musicNotation>
  • External things: <bindingDesc>, <sealDesc>, <accMat>

 

Example <physDesc>

msDesc/physDesc/objectDesc

<objectDesc> gives a way to describe the support, foliation, collation, condition, layouts, and more.

msDesc/physDesc/handDesc (& typeDesc & scriptDesc)

msDesc/physDesc/musicNotation (& decoDesc & additions)

msDesc/physDesc/bindingDesc (& sealDesc & accMat)

msDesc/history

 <history> groups elements describing the full history of a manuscript or manuscript part.

 

  • <origin>: where it all began
  • <provenance>: everything in between
  • <acquisition>: how you acquired it

 

Although <origin> is a member of att.datable, so has all the usual dating attributes, it also has special purpose elements <origDate> and <origPlace> to record the manuscript's origin date and place.

 

Example <history>

Example <history>

Example <history>

msDesc/additional

  • <additional> groups additional information, combining bibliographic information about a manuscript, or surrogate copies of it with curatorial or administrative information.
  • <adminInfo> administrative information
  • <surrogates> information about other surrogates (e.g. photographs, microfilms, digital images)  etc.
  • <listBibl> bibliography of works concerning the manuscript

Example <additional>

Example <additional>

msDesc/msPart

  • <msPart>: to describe individual parts of a composite manuscript
  • <msFrag>: to describe manuscript fragments as part of a virtual whole

TEI header module elements:

TEI manuscript description module elements:

TEI Metadata with Manuscript Description

By James Cummings

TEI Metadata with Manuscript Description

A workshop presentation of TEI Metadata and Manuscript Description

  • 2,313