Data Management

For Grown Ups

Terrell Russell, Ph.D.

@terrellrussell

Senior Data Scientist, iRODS Consortium

Renaissance Computing Institute (RENCI), UNC-Chapel Hill

iRODS Consortium

The iRODS Consortium was created to ensure the sustainability of iRODS and to further its adoption and continued evolution. To this end, the Consortium works to standardize the definition, development, and release of iRODS-based data middleware technologies, evangelize iRODS among potential users, promote new advances in iRODS, and expand the adoption of iRODS-based data middleware technologies through the development, release, and support of an open-source, mission-critical, production-level distribution of iRODS.

 

Current Members:

Hard Problems, Today

  • Petabytes of data
  • Multiple namespaces
  • Keeping data in sync
  • Confirming integrity over time
  • Sharing between co-workers
  • Sharing between partner organizations
  • Migrating to new equipment
  • Automating backups
  • Tiering, HSM
  • Orchestrating HPC
  • Finding last week's data
  • Auditing of access/use
  • Automatically ingesting new data
  • Automating workflows

Data Management

Multiple pieces

Multiple meanings

Multiple goals

Data Management

  • Access - Authentication, Authorization, Revocation

Data Management

  • Access
  • Description - Standards for discovery, compliance

Data Management

  • Access
  • Description
  • Integrity - Confidence that nothing has changed

Data Management

  • Access
  • Description
  • Integrity
  • Replication - Multiple copies, multiple locations

Data Management

  • Access
  • Description
  • Integrity
  • Replication
  • Availability - If things are down, nothing else matters

Data Management

  • Access
  • Description
  • Integrity
  • Replication
  • Availability
  • Migration - Hardware changes, format changes 

Data Management

  • Access
  • Description
  • Integrity
  • Replication
  • Availability
  • Migration
  • Recovery - Robust plans for when things go wrong

Data Management

  • Access
  • Description
  • Integrity
  • Replication
  • Availability
  • Migration
  • Recovery
  • Provenance - Full record of all related activity 

Data Management

  • Access
  • Description
  • Integrity
  • Replication
  • Availability
  • Migration
  • Recovery
  • Provenance
  • Retention - Deleting data on a defined schedule

People with Keys  +  Notes/Reports

 

 

 

Passwords  +  Folders  +  Scripts (Maybe)

 

 

 

Credentials  +  Metadata  +  Automation

Policy Enforcement - Through the Years

Data Management

Fraught with People

Four Verticals  Four Case Studies

  • Health Care & Life Science
  • Oil & Gas
  • Media & Entertainment
  • Archives & Records Management

Health Care & Life Science

Genomics Use Case - Data begins as series of images from a sequencer, converted to bases (ATCG), fragmented, aligned, annotated for variants, filtered, analyzed

 

  • Extensive Data Pipelines
  • Saved State
  • Diverse Data Products
  • Share Results

Health Care & Life Science

Priorities:

  • reproducibility
  • multi-institutional collaboration

 

Oil & Gas

Ingest Use Case - As existing storage fills up, complementary strategies 1) migrate from active to slower, cheaper archive and 2) add more active. Traditional HSM has limited flexibility (access date, physical location, etc.) and additional namespaces just add more complexity.

 

  • Diverse Data Sources
  • Spread Geographically
  • Computationally Intense

Oil & Gas

Priorities:

  • unified namespace
  • automated analytics

 

Media & Entertainment

Born Digital Use Case - New valuable creative content (movie assets, original musical tracks) requires large, robust, long-term, flexible, accessible infrastructure.

 

  • Popular Content
  • Unique
  • Largely Video and Games

Media & Entertainment

Priorities:

  • access control
  • backups
  • integrity

 

Archives & Records Management

Provenance Use Case - Libraries, museums, and other cultural institutions have a 100+ year view on their digital assets.  Must maintain archival and dissemination copies.  Lots of metadata. 

 

  • Cultural Heritage
  • Original and Derivative Copies
  • Quality Search and Browse

Archives & Records Management

Priorities:

  • provenance
    • integrity
    • migration
  • metadata
  • replication

 

Four Verticals  Four Case Studies

  • Health Care & Life Science
  • Oil & Gas
  • Media & Entertainment
  • Archives & Records Management

The Four Pillars

Open Source Data Management Middleware

  • iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.

 

  • iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.

 

  • iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.

 

  • iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.

Questions?

SC15 Booth #181

 

irods.org

github.com/irods

@irods

 

 

 

 

 

Creative Commons Images Used:

https://www.flickr.com/photos/addieplum/116062198/

https://www.flickr.com/photos/ajmexico/3281139507/

https://www.flickr.com/photos/future15/2037742362/

SC15 - Data Management For Grown Ups

By iRODS Consortium

SC15 - Data Management For Grown Ups

Case Studies for Proper Data Management

  • 2,149