@adamretter / email@example.com
Mostly Scala and Java
Open Source Hacker
"eXist" for O'Reilly
W3C (Invited Expert)
CSV on the Web, Provenance
Memory is finite
Stories (facts) become distorted over time
Point of truth
The National Archives
Preserving the nations history
Special Collections (e.g. Records of LOCOG)
Archive Records of UK from OGDs, NGOs and Special Interest
Excellent at traditional Paper records
One of the largest collections in the world
Over 11 million historical Government and Public Records
However, most records today are not created on paper!
Predicted 2013 - 2020:
>6PB of Digital Records to Archive
50% of which will be Born Digital
2009: Existing Digital Records System will not cope...
2011: Build new Digital Records Infrastructure = ME :-)
Records arrive via:
Hard Disks (USB etc)
DVD / CD / Digital Video Cassette / Tape (mostly LTO 1 to 6)
Test, Secure and Examine Records (Pre-Ingest)
Extract Metadata and Archive (Ingest)
Enable Digital Archivists (Search, Retrieval and Edit)
Export Transcoded Records and Metadata (Publish / Sell)
PRONOM - File format databasehttp://apps.nationalarchives.gov.uk/pronom
DROID - File Identification Toolhttp://digital-preservation.github.io/droid
CSV Schema and CSV Validatorhttp://digital-preservation.github.io/csv-schema
Shadoop - Scala DSL for Hadoophttps://github.com/adamretter/shadoop
"In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change."
"The goal of digital preservation is the accurate rendering of authenticated content over time."
- Taken from Wikipedia: https://en.wikipedia.org/wiki/Digital_preservation
File Identification and Analysis / Hardware Analysis
Emulation vs. Migration
Multiple copies on diverse media at multiple sites
Media Retention Policy - Frequently renewed and rewritten
Duh! The same reasons as archiving records.
The Internet Archive?
UK Government Web Archive?
Web Crawling :-(
Web Pages / File Downloads
Databases and Query End-points
e.g. CSV Data without headings and/or schema
Crawling RDF and SPARQL
Self-describing or described data
Some formats are better than others!
Human readable Schema?
Machine readable Schema?
Even a timestamp in the data is very useful!
Is YOUR open data ammenable to crawling?
How to archive without crawling?