Introducing the UK National Archives Metadata Vocabulary

Robert Walpole

Devexe Limited

Photo by Robert Walpole

Disclaimer

This presentation is in no way intended to express views or opinions of the UK National Archives and is solely the work of Robert Walpole, an employee of Devexe Limited who were contracted to assist in the development of the Digital Records Infrastructure at The National Archives (TNA) in Kew, London, England.

What's in The UK National Archives?

Over 11 million historical government and public records.

Documents, maps, files and various kinds of images covering over 1000 years of history.

Photo by The National Archives

The 20(30) year rule and the digital tsunami

By 2025 it is expected that the Archives will receive almost exclusively born-digital records.

Digital preservation is not easy...

Photo by Regregex

What are TNA doing about this?

Digital Records Infrastructure (DRI)

Photo by Derrick Coetzee

DRI and OAIS

OAIS - Open Archival Information System

Reference model developed by Consultative Committee for Space Data Systems (CCSDS) which became ISO 14721 in 2003

  1. Negotiate for and accept information
  2. Obtain sufficient control of the information
  3. Determine the scope of the user community
  4. Ensure information is independently understandable
  5. Authenticate copies and preserve against contingencies
  6. Make the preserved information available to users

Six mandatory principles (summary)

SIPs, AIPs and DIPs

  • Submission Information Package
  • Archival Information Package
  • Dissemination Information Package

But what exactly is a SIP?

(or AIP or DIP for that matter)

SIPs, AIPs and DIPs

The digital objects (files and folders) for preservation..

..together with metadata about these objects.

  1. Negotiate for and accept information
  2. Obtain sufficient control of the information
  3. Determine the scope of the user community
  4. Ensure information is independently understandable
  5. Authenticate copies and preserve against contingencies
  6. Make the preserved information available to users

Six mandatory principles (summary)

XML Information Package (XIP)

  • XML is plain text
  • Plain text has proved to be very durable
  • No special architecture, format or encoding
  • Minimal processing required to view
  • XML is very widely used
  • XML - is well established and widely recognised

Why XML?

<XIP xmlns="http://www.tessella.com/XIP/v4">
    <Collections>
        <Collection status="same">
            <CollectionRef>d94bcdd5-ea94-473f-9f3b-008fa93caeb8</CollectionRef>
            <CollectionTypeRef>1</CollectionTypeRef>
            <CollectionCode>LEVES</CollectionCode>
            <Title>Inquiry into the Culture, Practices and Ethics of the Press
                (The Leveson Inquiry)</Title>
            <SecurityTag>open</SecurityTag>
        </Collection>
    </Collections>
    <DeliverableUnits>
        <DeliverableUnit status="same">
            <DeliverableUnitRef>3140421b-02c3-9a06-1a197c497ba8</DeliverableUnitRef>
            <CollectionRef>d94bcdd5-ea94-9f3b-008fa93caeb8</CollectionRef>
            <AccessionRef>58310a0f-5fd4-b565-2ac46fff4d59</AccessionRef>
            <AccumulationRef>025519fa-4ca3-bdbd-4538f0f44f5e</AccumulationRef>
            <CatalogueReference>LEV/2/CHN2/Z</CatalogueReference>
            <CoverageFrom>2014-02-11T21:16:09.380Z</CoverageFrom>
            <CoverageTo>2014-02-11T21:16:09.380Z</CoverageTo>
            <Title>LEV 2</Title>
            <SecurityTag>open</SecurityTag>
            <Metadata/>
            

XML Information Package (XIP)

CSV Metadata

  • Title - a meaningful file or folder name
  • Identifier - URI of original filepath
  • Date - usually last modified in ISO 8601
  • Folder or File - is the record a folder or file?
  • Checksum - the original SHA-256 checksum
  • Copyright - usually Crown Copyright
<row>
    <elem name="title">Letter.doc</elem>
    <elem name="identifier">file:/T:/LEV_3/Letter.doc</elem>
    <elem name="date">2013-05-13T14:26:56</elem>
    <elem name="folder">file</elem>
    <elem name="checksum">file</elem>
    <elem name="copyright">Crown copyright</elem>
</row>

CSV metadata as XML

Merging CSV/XML Metadata

XIP Metadata Version 1.0

<dcterms:identifier>WO/409/27/1</dcterms:identifier>

<Metadata/> element defined as follows:

"Arbitary contents, which may conform to an XML Schema. Used to store extra metadata, particularly descriptive metadata... Allows for controlled extension of the schema" 

<dcterms:identifier xsi:type="tnacat:itemIdentifier">
    <departmentCode>WO</departmentCode>
    <seriesCode>409</seriesCode>
    <pieceCode>27/1</pieceCode>
    <itemCode>1</itemCode>
</dcterms>

XIP Metadata Version 1.0

  • Validation not possible 
  • Schemas only act as documentation 

DRI Catalogue

  • W3C Standards - RDF, SPARQL, OWL, Turtle
  • Apache Jena Framework (ARQ, Fuseki, TDB)
  • Linked Data API
  • OWL Ontotology (DRI Vocabulary)

Provides a persistent inventory and process control system for DRI based on Semantic Web technologies:

https://goo.gl/NexoRC

<rdf:Description rdf:about="http://example.org/book/1234">
    <ex:title>A Good Book</ex:title>
</rdf:Description>
<rdf:Description rdf:about="http://example.org/book/1234"
    ex:title="A Good Book"/>

"RDF/XML never became popular with XML people because of the potential difficulty and complexity in processing it.." Bob du Charme (Learning SPARQL)

RDF/XML

What's in a digital archive?

Photo by Jason Coleman

Four types of digital record identified...

Four types of digital record identified...

Four types of digital record identified...

Four types of digital record identified...

Four types of digital record identified...

...and five categories of metadata

...and five categories of metadata

...and five categories of metadata

...and five categories of metadata

...and five categories of metadata

...and five categories of metadata

There may be more in the future...

RDF/XML Metadata

<tna:BornDigitalRecord rdf:about="http://example.org/66/LEV/2/D4SL/Z">
    <tna:legalStatus>Public Record</tna:legalStatus>
</tna:BornDigitalRecord>

<tna:BornDigitalRecord rdf:about="http://example.org/66/LEV/2/D4SL/Z">
    <tna:legalStatus rdf:resource="http://example.org/Public_record"/>
</tna:BornDigitalRecord>
<dcterms:identifier xsi:type="tnacat:itemIdentifier">
    <departmentCode>WO</departmentCode>
    <seriesCode>409</seriesCode>
    <pieceCode>27/1</pieceCode>
    <itemCode>1</itemCode>
</dcterms>

XIP Metadata Version 1.0 vs 2.0

XIP Metadata Version 1.0 vs 2.0

<tna:BornDigitalRecord rdf:about="http://example.org/66/LEV/2/D4SL/Z">
    <tna:cataloguing>
        <tna:Cataloguing>
            <tna:departmentIdentifier>WO</tna:departmentIdentifier>
            <tna:seriesIdentifier>409</tna:seriesIdentifier>
            <tna:pieceIdentifier>27/1</tna:piecceIdentifier>
            <tna:itemIdentifier>1</tna:itemIdentifier>
        </tna:Cataloguing>
    </tna:cataloguing>
</tna:BornDigitalRecord>

What does all this mean?

  • TNA now has a controlled (OWL) vocabulary for describing digital records
  • TNA has XML schemas for validating this controlled RDF/XML

So what has been gained?

  • Human readable metadata 
  • Metadata that can be validated with XML Schema 

So what has been gained?

  • Human readable metadata 
  • Metadata that can be validated with XML Schema 
  • Machine readable metadata 
  • Interchangeable metadata (Linked Data) 
  • Editable metadata (without tape wear) 
  • Query-able metadata (SPARQL) 

https://github.com/digital-preservation/dri-vocabulary

tna.owl

The Vocabulary

A work in progress...

...but mature enough to share.

Thank you.

Made with Slides.com