Data Management
For Grown Ups
Terrell Russell, Ph.D.
@terrellrussell
Senior Data Scientist, iRODS Consortium
Renaissance Computing Institute (RENCI), UNC-Chapel Hill
iRODS Consortium
The iRODS Consortium was created to ensure the sustainability of iRODS and to further its adoption and continued evolution. To this end, the Consortium works to standardize the definition, development, and release of iRODS-based data middleware technologies, evangelize iRODS among potential users, promote new advances in iRODS, and expand the adoption of iRODS-based data middleware technologies through the development, release, and support of an open-source, mission-critical, production-level distribution of iRODS.
Current Members:
RENCI, DICE, Seagate, DDN, Novartis, IBM, Complete Genomics, Wellcome Trust Sanger Institute, UCL, Cleversafe, EMC, and the NASA Atmospheric Science Data Center
Data Management
Multiple pieces
Multiple meanings
Multiple goals
Data Management
- Access - Authentication, Authorization, Revocation
Data Management
- Access
- Description - Standards for discovery, compliance
Data Management
- Access
- Description
- Integrity - Confidence that nothing has changed
Data Management
- Access
- Description
- Integrity
- Replication - Multiple copies, multiple locations
Data Management
- Access
- Description
- Integrity
- Replication
- Availability - If things are down, nothing else matters
Data Management
- Access
- Description
- Integrity
- Replication
- Availability
- Migration - Hardware changes, format changes
Data Management
- Access
- Description
- Integrity
- Replication
- Availability
- Migration
- Recovery - Robust plans for when things go wrong
Data Management
- Access
- Description
- Integrity
- Replication
- Availability
- Migration
- Recovery
- Provenance - Full record of all related activity
Data Management
- Access
- Description
- Integrity
- Replication
- Availability
- Migration
- Recovery
- Provenance
- Retention - Deleting data on a defined schedule
People with Keys + Notes/Reports
Passwords + Folders + Scripts (Maybe)
Credentials + Metadata + Automation
Policy Enforcement - Through the Years
Data Management
Fraught with People
Four Verticals → Four Case Studies
- Health Care & Life Science
- Oil & Gas
- Media & Entertainment
- Archives & Records Management
Health Care & Life Science
Genomics Use Case - Data begins as series of images from a sequencer, converted to bases (ATCG), fragmented, aligned, annotated for variants, filtered, analyzed
- Extensive Data Pipelines
- Saved State
- Diverse Data Products
- Share Results
Health Care & Life Science
Priorities:
- reproducibility
- multi-institutional collaboration
Oil & Gas
Ingest Use Case - As existing storage fills up, complementary strategies 1) migrate from active to slower, cheaper archive and 2) add more active. Traditional HSM has limited flexibility (access date, physical location, etc.) and additional namespaces just add more complexity.
- Diverse Data Sources
- Spread Geographically
- Computationally Intense
Oil & Gas
Priorities:
- unified namespace
- automated analytics
Media & Entertainment
Born Digital Use Case - New valuable creative content (movie assets, original musical tracks) requires large, robust, long-term, flexible, accessible infrastructure.
- Popular Content
- Unique
- Largely Video and Games
Media & Entertainment
Priorities:
- access control
- backups
- integrity
Archives & Records Management
Provenance Use Case - Libraries, museums, and other cultural institutions have a 100+ year view on their digital assets. Must maintain archival and dissemination copies. Lots of metadata.
- Cultural Heritage
- Original and Derivative Copies
- Quality Search and Browse
Archives & Records Management
Priorities:
- provenance
- integrity
- migration
- metadata
- replication
Four Verticals → Four Case Studies
- Health Care & Life Science
- Oil & Gas
- Media & Entertainment
- Archives & Records Management
The Four Pillars
Open Source Data Management Middleware
- iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
- iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
- iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
- iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.
Questions?
irods.org
github.com/irods
@irods
Creative Commons Images Used:
https://www.flickr.com/photos/addieplum/116062198/
https://www.flickr.com/photos/ajmexico/3281139507/
https://www.flickr.com/photos/future15/2037742362/
Data Management For Grown Ups
By iRODS Consortium
Data Management For Grown Ups
Case Studies for Proper Data Management
- 1,768