iRODS and Reproducible Science
October 26-27, 2016
All Things Open
Raleigh, NC
Terrell Russell, Ph.D.
@terrellrussell
Chief Technologist, iRODS Consortium
Problem 1
Too Much Data
Problem 1
Too Much Data
"90% of the world's data created within the last two years"
Problem 1
Too Much Data
"90% of the world's data created within the last two years"
Probably true. Every year. For the last few decades.
Problem 1
Too Much Data
Coming in too fast
Without good source information
Getting stored wherever there is room
Getting lost
Getting corrupted
Getting forgotten
Problem 2
Science is Hard
Problem 2
Science is Hard
Built on...
Problem 2
Science is Hard
Built on...
But above all...
Problem 2
Science is Hard
Built on...
But above all...
Problem 2
Science is Hard
Built on...
But above all...
Must be Reproducible by Others
Data Management
Scientists, managers, and their network administrators must:
Hard enough for today...
Data Management
Scientists, managers, and their network administrators must:
Hard enough for today...
Some funders mandate data accessibility for 10 years from the
"last date on which access to the data was requested by a third party".
Data Management
For 10+ years, data must be:
Data Management
For 10+ years, data must be:
Automatically
Data Management
These long-term management tasks are too much for a curator or librarian, and certainly too much for the scientists themselves, to handle by hand.
There must be organizational policy in place to handle the varied scenarios of data retention, data access, and data use.
There must be automation in place to provide consistency and confidence in the process.
Confidence in tools comes from open frameworks and common, observable patterns in behavior and interoperability.
Big Science is Accelerating
Most "Big Science" is now multi-institute, multi-author, and moving at great speed towards modeling and other computational techniques for greater coverage and impact. The complexities related to storage, collaborative work, and data sharing will only increase.
Unilever
MSST 2016
1/day to 1000s/day
is an open source system that was designed for these requirements
iRODS provides the policy-based data management that is demanded by
modern, large-scale, distributed scientific and business endeavors through:
Data to Compute
Compute to Data
Use Cases
The Wellcome Trust Sanger Institute
National Institute of Environmental Health Sciences
NASA Atmospheric Science Data Center
University College London
The Wellcome Trust Sanger Institute
Sanger - Federation
Use Cases
The Wellcome Trust Sanger Institute
National Institute of Environmental Health Sciences
NASA Atmospheric Science Data Center
University College London
National Institute of Environmental Health Sciences
Use Cases
The Wellcome Trust Sanger Institute
National Institute of Environmental Health Sciences
NASA Atmospheric Science Data Center
University College London
NASA Atmospheric Science Data Center
Use Cases
The Wellcome Trust Sanger Institute
National Institute of Environmental Health Sciences
NASA Atmospheric Science Data Center
University College London
University College London
Policy-Based Data Management
Too Much Data
Science (and Business) is Hard
Organizations need Data Policy
Data Management must be Automated
Thank You
Terrell Russell
@terrellrussell
irods.org
@irods
iRODS is open source software for…
• Working with data distributed across storage technologies
• Annotating and searching data with rich metadata
• Implementing access control, auditing, preservation, organization, and data movement policies
• Providing a single interface to share data between organizations
Data Virtualization
iRODS presents multiple separate storage technologies in a unified namespace.
Data Virtualization
Logical Path
Physical Path(s)
Data Virtualization
Logical Path | /tempZone/home/rods/thefile.txt |
Physical Path(s) (replicas) |
/var/lib/irods/iRODS/Vault/home/rods/thefile.txt /tmp/u2vault/home/rods/thefile.txt /tmp/u1vault/home/rods/thefile.txt |
$ ils -L /tempZone/home/rods/thefile.txt rods 0 demoResc 29606 2016-10-05.09:05 & thefile.txt generic /var/lib/irods/iRODS/Vault/home/rods/thefile.txt rods 1 repl;u2 29606 2016-10-05.09:06 & thefile.txt generic /tmp/u2vault/home/rods/thefile.txt rods 2 repl;u1 29606 2016-10-05.09:06 & thefile.txt generic /tmp/u1vault/home/rods/thefile.txt
Data Discovery
iRODS provides a catalog, the iCAT, that links data and metadata.
Workflow Automation
iRODS lets you use any operation within the system to trigger a programmatic action
Secure Collaboration
iRODS lets you share data across administrative units at any time after deployment
The iRODS Plugin Architecture