Data Management
For Grown Ups


Terrell Russell, Ph.D.
@terrellrussell
Senior Data Scientist, iRODS Consortium
Renaissance Computing Institute (RENCI), UNC-Chapel Hill



iRODS Consortium

The iRODS Consortium was created to ensure the sustainability of iRODS and to further its adoption and continued evolution. To this end, the Consortium works to standardize the definition, development, and release of iRODS-based data middleware technologies, evangelize iRODS among potential users, promote new advances in iRODS, and expand the adoption of iRODS-based data middleware technologies through the development, release, and support of an open-source, mission-critical, production-level distribution of iRODS.
Current Members:
RENCI, DICE, Seagate, DDN, Novartis, IBM, Complete Genomics, Wellcome Trust Sanger Institute, UCL, Cleversafe, EMC, and the NASA Atmospheric Science Data Center
Data Management

Multiple pieces
Multiple meanings
Multiple goals
Data Management

- Access - Authentication, Authorization, Revocation
Data Management

- Access
- Description - Standards for discovery, compliance
Data Management

- Access
- Description
- Integrity - Confidence that nothing has changed
Data Management

- Access
- Description
- Integrity
- Replication - Multiple copies, multiple locations
Data Management

- Access
- Description
- Integrity
- Replication
- Availability - If things are down, nothing else matters
Data Management

- Access
- Description
- Integrity
- Replication
- Availability
- Migration - Hardware changes, format changes
Data Management

- Access
- Description
- Integrity
- Replication
- Availability
- Migration
- Recovery - Robust plans for when things go wrong
Data Management

- Access
- Description
- Integrity
- Replication
- Availability
- Migration
- Recovery
- Provenance - Full record of all related activity
Data Management

- Access
- Description
- Integrity
- Replication
- Availability
- Migration
- Recovery
- Provenance
- Retention - Deleting data on a defined schedule

People with Keys + Notes/Reports
Passwords + Folders + Scripts (Maybe)
Credentials + Metadata + Automation


Policy Enforcement - Through the Years
Data Management
Fraught with People
Four Verticals → Four Case Studies

- Health Care & Life Science
- Oil & Gas
- Media & Entertainment
- Archives & Records Management
Health Care & Life Science

Genomics Use Case - Data begins as series of images from a sequencer, converted to bases (ATCG), fragmented, aligned, annotated for variants, filtered, analyzed
- Extensive Data Pipelines
- Saved State
- Diverse Data Products
- Share Results
Health Care & Life Science

Priorities:
- reproducibility
- multi-institutional collaboration

Oil & Gas

Ingest Use Case - As existing storage fills up, complementary strategies 1) migrate from active to slower, cheaper archive and 2) add more active. Traditional HSM has limited flexibility (access date, physical location, etc.) and additional namespaces just add more complexity.
- Diverse Data Sources
- Spread Geographically
- Computationally Intense
Oil & Gas

Priorities:
- unified namespace
- automated analytics

Media & Entertainment

Born Digital Use Case - New valuable creative content (movie assets, original musical tracks) requires large, robust, long-term, flexible, accessible infrastructure.
- Popular Content
- Unique
- Largely Video and Games
Media & Entertainment

Priorities:
- access control
- backups
- integrity

Archives & Records Management

Provenance Use Case - Libraries, museums, and other cultural institutions have a 100+ year view on their digital assets. Must maintain archival and dissemination copies. Lots of metadata.
- Cultural Heritage
- Original and Derivative Copies
- Quality Search and Browse
Archives & Records Management

Priorities:
- provenance
- integrity
- migration
- metadata
- replication

Four Verticals → Four Case Studies

- Health Care & Life Science
- Oil & Gas
- Media & Entertainment
- Archives & Records Management
The Four Pillars



Open Source Data Management Middleware

- iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
- iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
- iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
- iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.
Questions?

irods.org
github.com/irods
@irods
Creative Commons Images Used:
https://www.flickr.com/photos/addieplum/116062198/
https://www.flickr.com/photos/ajmexico/3281139507/
https://www.flickr.com/photos/future15/2037742362/
Data Management For Grown Ups
By iRODS Consortium
Data Management For Grown Ups
Case Studies for Proper Data Management
- 1,892