Data Preservation

June 15, 2016

Public Health England

London, England

Jason M. Coposky

@jason_coposky

Interim Executive Director

Backup vs. Replication

Backup	Replication
Prevent Data Loss	Swift Recovery in Disaster
Snapshot in time	Up to date instance
Cheap, slow storage	Nearly identical storage
Many Copies over time	1-3 instances, up to date
Recovery from any time	Only up to date instance
Safe from user error	May also be affected by users
DR is tested regularly	Replication is kept up to date

Implementing Backup with iRODS

Goal - provide snapshots in time of of collections and data objects in order to provide disaster recovery

Given the requirements we need:

Identifying Collections for Backup

Text

Possible Options

Metadata - tag collections for backup
- Include frequency, priority, tiers of storage
White List - provide a manifest for collections which are backed up
Black List - May be easier to identify those that are not

Collection Snapshots

Use delayed execution rule to identify collections for backup - then push new backup onto a the queue
Designate backup storage resources
Collections are not replicated, they are copied - new data_obj_ids, new logical paths and timestamps
- Leverage msiCollRsync
- Possibly use bundle operations to create a bzip-tar archive of collections
Snapshot frequency is a data management policy - based on data value, storage age, user concerns
Collection size should be considered - provide a high water mark for over-sized collections
Lock down permissions - protect from user error

Metadata for Recovery

Disaster Recovery

Potentially create a tool (rule file) which will

Public Health England - Data Preservation

More from jason coposky