iRODS in the Cloud:

Organizational Data Management

Terrell Russell, Ph.D.

@terrellrussell

Executive Director, iRODS Consortium

November 12-17, 2023

Supercomputing 2023

Denver, CO

Our Membership

Consortium

Member

Consortium

Member

Consortium

Member

What is iRODS

Open Source

  • C++ client-server architecture
  • BSD-3 Licensed

 

Distributed

  • Runs on a laptop, a cluster, on premises, or geographically distributed

 

Data Centric & Metadata Driven

  • Insulate both your users and your data from your infrastructure

History

  • 1995 - SRB started (grid storage)
  • 2004 - iRODS started (added rule engine / policy)
  • 2013 - Consortium founded by RENCI, DICE, and DDN
  • 2014 - Consortium accepted the code base
  • 45 releases of iRODS to date

iRODS as the Integration Layer

iRODS Core Competencies

The Data Management Model

Ingest to Institutional Repository

As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.

Data Management

"The development, execution and supervision of plans, policies, programs, and practices that control, protect, deliver, and enhance the value of data and information assets."

 

 

Most organizations are still managing their assets with a collection of small scripts, tribal knowledge, vigilance, and hope.

 


Organizations, instead, need a future-proof solution to managing data and its surrounding infrastructure.

Hard Problems - Today

  • Peta/Exabytes of data
  • Multiple namespaces
  • Keeping data in sync
  • Confirming integrity over time
  • Sharing between co-workers
  • Sharing between partner organizations
  • Migrating to new equipment
  • Automating backups
  • Tiering, HSM
  • Orchestrating HPC
  • Finding last week's data
  • Auditing of access/use
  • Automatically ingesting new data
  • Automating workflows

Data Management

Multiple pieces

Multiple meanings

Multiple goals

Data Management

  • Access - Authentication, Authorization, Revocation
  • Description - Standards for discovery, compliance
  • Integrity - Confidence that nothing has changed
  • Replication - Multiple copies, multiple locations
  • Availability - If things are down, nothing else matters
  • Migration - Hardware changes, format changes
  • Recovery - Robust plans for when things go wrong
  • Provenance - Full record of all related activity
  • Retention - Deleting data on a defined schedule

Policy Enforcement - Through the Years

People with Keys  +  Notes/Reports

 

 

 

Passwords  +  Folders  +  Scripts (Maybe)

 

 

 

Credentials  +  Metadata  +  Automation

Data Management

Fraught with People

Data Management

These long-term management tasks are too much for a curator or librarian, and certainly too much for the scientists themselves, to handle by hand.

 

There must be organizational policy in place to handle the varied scenarios of data retention, data access, and data use.

 

There must be automation in place to provide consistency and confidence in the process.

 

Confidence in tools comes from open frameworks and common, observable patterns in behavior and interoperability.

Data Management

ONLY with the automation of policy can your system provide the types of guarantees that you are actually interested in

  • integrity
  • provenance
  • quality metadata enforcement
  • reproducibility

 

Leaving the humans in charge of policy enforcement is a mistake.

 

  • People should craft the policy together.
  • Machines should enforce the defined policy.

Data management

should be

data-centric and metadata driven.

 

 

 

Future-proof automated data management

requires

open formats and open source.

Questions?

Thank you.

Terrell Russell

@terrellrussell

iRODS Consortium