iRODS in the Cloud:
Organizational Data Management
Terrell Russell, Ph.D.
@terrellrussell
Executive Director, iRODS Consortium
November 12-17, 2023
Supercomputing 2023
Denver, CO
Our Membership
Consortium
Member
Consortium
Member
Consortium
Member
What is iRODS
Open Source
- C++ client-server architecture
- BSD-3 Licensed
Distributed
- Runs on a laptop, a cluster, on premises, or geographically distributed
Data Centric & Metadata Driven
- Insulate both your users and your data from your infrastructure
History
- 1995 - SRB started (grid storage)
- 2004 - iRODS started (added rule engine / policy)
- 2013 - Consortium founded by RENCI, DICE, and DDN
- 2014 - Consortium accepted the code base
- 45 releases of iRODS to date
iRODS as the Integration Layer
iRODS Core Competencies
The Data Management Model
Ingest to Institutional Repository
As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.
Data Management
"The development, execution and supervision of plans, policies, programs, and practices that control, protect, deliver, and enhance the value of data and information assets."
Most organizations are still managing their assets with a collection of small scripts, tribal knowledge, vigilance, and hope.
Organizations, instead, need a future-proof solution to managing data and its surrounding infrastructure.
Hard Problems - Today
- Peta/Exabytes of data
- Multiple namespaces
- Keeping data in sync
- Confirming integrity over time
- Sharing between co-workers
- Sharing between partner organizations
- Migrating to new equipment
- Automating backups
- Tiering, HSM
- Orchestrating HPC
- Finding last week's data
- Auditing of access/use
- Automatically ingesting new data
- Automating workflows
Data Management
Multiple pieces
Multiple meanings
Multiple goals
Data Management
- Access - Authentication, Authorization, Revocation
- Description - Standards for discovery, compliance
- Integrity - Confidence that nothing has changed
- Replication - Multiple copies, multiple locations
- Availability - If things are down, nothing else matters
- Migration - Hardware changes, format changes
- Recovery - Robust plans for when things go wrong
- Provenance - Full record of all related activity
- Retention - Deleting data on a defined schedule
Policy Enforcement - Through the Years
People with Keys + Notes/Reports
Passwords + Folders + Scripts (Maybe)
Credentials + Metadata + Automation
Data Management
Fraught with People
Data Management
These long-term management tasks are too much for a curator or librarian, and certainly too much for the scientists themselves, to handle by hand.
There must be organizational policy in place to handle the varied scenarios of data retention, data access, and data use.
There must be automation in place to provide consistency and confidence in the process.
Confidence in tools comes from open frameworks and common, observable patterns in behavior and interoperability.
Data Management
ONLY with the automation of policy can your system provide the types of guarantees that you are actually interested in
- integrity
- provenance
- quality metadata enforcement
- reproducibility
Leaving the humans in charge of policy enforcement is a mistake.
- People should craft the policy together.
- Machines should enforce the defined policy.
Data management
should be
data-centric and metadata driven.
Future-proof automated data management
requires
open formats and open source.
Questions?
Thank you.
Terrell Russell
@terrellrussell
iRODS Consortium
SC23 - iRODS in the Cloud: Organizational Data Management
By iRODS Consortium
SC23 - iRODS in the Cloud: Organizational Data Management
- 162