Introduction to
Data Management
July 20, 2018
NSF CI CyberCarpentry Workshop
UNC-Chapel Hill
Terrell Russell, Ph.D.
@terrellrussell
Chief Technologist, iRODS Consortium
Introduction to
Data Management
Motivation
What is the system and model to solve these problems in an automated way?
Motivation
A Definition of Data Management
"The development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."
Organizations need a future-proof solution to managing data and its surrounding infrastructure.
What is iRODS
iRODS is
A flexible framework for the abstraction of infrastructure
iRODS as the Integration Layer
Creating the Machine
What are the necessary component to build such a system?
iRODS Architecture
iRODS is ultimately a Catalog and an RPC API
Catalog Service Consumers
Servers which provide access to storage resources
Catalog Service Provider
Same capabilities as the Consumer with the addition of a database plugin
Metadata Catalog
What are some limitations of this design?
Limitations of System Design
Addressing System Limitations
What to consider in an iRODS deployment
iRODS will run on a laptop or a rack of servers
What does iRODS actually do?
Core Competencies
Data Virtualization
Combine various distributed storage technologies into a Unified Namespace
iRODS provides a logical view into the complex physical representation of your data, distributed geographically, and at scale.
Data Virtualization
Logical Path
Physical Path(s)
Data Virtualization
$ ils -L /tempZone/home/rods/thefile.txt rods 0 demoResc 29606 2016-10-05.09:05 & thefile.txt generic /var/lib/irods/iRODS/Vault/home/rods/thefile.txt rods 1 repl;u2 29606 2016-10-05.09:06 & thefile.txt generic /tmp/u2vault/home/rods/thefile.txt rods 2 repl;u1 29606 2016-10-05.09:06 & thefile.txt generic /tmp/u1vault/home/rods/thefile.txt
Logical Path | /tempZone/home/rods/thefile.txt |
Physical Paths | /var/lib/irods/iRODS/Vault/home/rods/thefile.txt /tmp/u2vault/home/rods/thefile.txt /tmp/u1vault/home/rods/thefile.txt |
Data Discovery
Attach metadata to any first class entity within the iRODS Zone
iRODS provides automated and user-provided metadata which makes your data and infrastructure more discoverable, operational and valuable.
Metadata Everywhere
Workflow Automation
Integrated policy engine which is triggered by any operation within the framework
The iRODS rule engine provides the ability to capture real world policy as computer actionable rules which may allow, deny, or add context to operations within the system.
Dynamic Policy Enforcement
The iRODS rule may:
Dynamic Policy Enforcement
A single API call expands to many plugin operations all of which may invoke policy enforcement
Plugin Interfaces:
Secure Collaboration
iRODS allows for collaboration across administrative boundaries after deployment
iRODS provides the ability to federate namespaces across organizations without pre-coordinated funding or effort.
iRODS Service Interface
Federation - Shared Data and Services
Institutional repositories
As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.
Patterns and
Use Cases
iRODS
On Premises to Any Cloud Infrastructure
Data to Compute Use Case
Compute to Data Use Case
The Wellcome Sanger Institute
Sanger - Replication
Sanger - Metadata
attribute: library
attribute: total_reads
attribute: type
attribute: lane
attribute: is_paired_read
attribute: study_accession_number
attribute: library_id
attribute: sample_accession_number
attribute: sample_public_name
attribute: manual_qc
attribute: tag
attribute: sample_common_name
attribute: md5
attribute: tag_index
attribute: study_title
attribute: study_id
attribute: reference
attribute: sample
attribute: target
attribute: sample_id
attribute: id_run
attribute: study
attribute: alignment
Sanger - Federation
University College London
Roadmap
iRODS Software
The Roadmap
The Roadmap - iRODS 4.3
Packaged iRODS Capabilities
Multipart Transfer
Provide reliable transfer with restart
- object parts tracked in the catalog
Later versions will provide fast,
first class access to object storage
iRODS 4.2 and Beyond - The Scatter
Next Generation Query Interface
iRODS 4.3 and Beyond - The Gather
Shared Data - Shared Infrastructure
Metadata Templates
Business Model
iRODS Consortium
The iRODS Consortium
Our Mission
Why Open Source
Our Membership
Our Business Model
Consortium Membership
Our Business Model
Service & Support Contracts
Membership Committees
Technology Working Group
Membership Committees
Planning Committee
Membership Committees
Executive Board
Additional working groups are formed as required
Our Consortium Participation