iRODS Technology in Depth

  May 5, 2015

  Presented to

  by    Jason Coposky - jasonc@renci.org

  and Dan Bedard - danb@renci.org

 

Agenda

(C) 2015 THE IRODS CONSORTIUM

What is iRODS?

Part 1: iRODS Lets You Control Your Data

(C) 2015 THE IRODS CONSORTIUM

(C) 2015 THE IRODS CONSORTIUM

            LETS YOU

 

CONTROL YOUR DATA

 

AND PROVE IT

(C) 2015 THE IRODS CONSORTIUM

            LETS YOU

Control access to data based on any characteristic of the data, connection, user, or resource.

 

Prove integrity and custody of the data.

 

Retain, archive, and destroy data according to policy.

(C) 2015 THE IRODS CONSORTIUM

            LETS YOU

Control and access data spread across storage in different sites, from different vendors.

 

Move huge data sets between multiple sites, quickly and verifiably.

 

Put the right data, in the right place, close to the right people (and out of reach of the wrong people).

(C) 2015 THE IRODS CONSORTIUM

            LETS YOU

Avoid buying the same data set twice.

 

Eliminate manual processing steps.

 

Keep track of processing steps applied, from raw data to finished product.

What is iRODS?

Part 2: iRODS is Open Source Data Grid Middleware

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

(C) 2015 THE IRODS CONSORTIUM

iRODS is the underlying technology for the world’s preeminent genomic research institutes. iRODS is an infinitely configurable data janitor. iRODS is the kind of technology you need to host everyone’s unstructured data. iRODS is a powerful data migration tool. iRODS is the technology that underpins the iPlant Data Store. iRODS is a data preservation technology. iRODS is a fundamental technology for CineGrid. iRODS is a tool for providing fine-grained privacy and security controls. iRODS is extensible: iRODS has command-line clients, APIs for numerous programming languages, and web clients. iRODS supports new plug-ins for storage resources, authentication mechanisms, microservices, and network prot

Middleware

iRODS acts as a bridge between applications and unstructured data (files).

  • Users don't need to know about the structure of the underlying file systems.
  • Uniform access to and control of data.

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

Data Virtualization

iRODS presents multiple separate file systems in a unified namespace.

  • Standard file systems: Any resource with a UNIX mount point.
  • Archival storage: HPSS, TSM
  • Object stores: Cleversafe, DDN WOS, Ceph/Rados
  • Cloud-based storage: Amazon S3

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

Data Discovery

iRODS provides a catalog, the iCAT, that links data and metadata.

  • Metadata can be system- or user-generated.
  • Users can find data using features such as description, study ID, access date.
  • Metadata can be used to link processed results to raw data (i.e., tracking provenance).
  • Administrators can use metadata to control policy, such as archiving and access control policies.

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

Workflow Automation

iRODS lets you use any condition to trigger any action.

  • User, file, and operating system activity caught by "policy enforcement points" (PEPs).
  • iRODS' "rule engine" links PEPs to microservices.

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

Workflow Automation

iRODS lets you use any condition to trigger any action. For example:

  • Metadata can be extracted once a file is placed in a landing zone.
  • Data can be staged for high-performance computing (HPC) operations.
  • Archiving and retention: data can be removed after an expiration date.
  • Transformation: iRODS can kick off processes, send notification upon completion, and store results as metadata.
  • Auditing: all iRODS user and file activity can be tracked in a log or separate database.

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

Secure Collaboration

Independently managed iRODS zones can be federated.

  • Local users can grant access for users from remote zones to read/write data and metadata.
  • Users log in (authenticate) through their home zones. Consistent interface across zones.
  • Administrators exchange one set of keys. No need to compromise on data management policy.

(C) 2015 THE IRODS CONSORTIUM

iRODS is open source data grid middleware for...

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

iRODS is open source data grid middleware for...

 

  • Data Virtualization
  • Data Discovery
  • Workflow Automation
  • Secure Collaboration

(C) 2015 THE IRODS CONSORTIUM

iRODS is the underlying technology for the world’s preeminent genomic research institutes. iRODS is an infinitely configurable data janitor. iRODS is the kind of technology you need to host everyone’s unstructured data. iRODS is a powerful data migration tool. iRODS is the technology that underpins the iPlant Data Store. iRODS is a data preservation technology. iRODS is a fundamental technology for CineGrid. iRODS is a tool for providing fine-grained privacy and security controls. iRODS is extensible: iRODS has command-line clients, APIs for numerous programming languages, and web clients. iRODS supports new plug-ins for storage resources, authentication mechanisms, microservices, and network prot

sits between the file system and the application

← all your storage in a single namespace

← metadata annotation

← über cron

← consolidation of access and control across sites

How is iRODS Used?

Use Case: Sanger Institute

(C) 2015 THE IRODS CONSORTIUM

The Wellcome Trust Sanger Institute

  • Largest single contributor to original Human Genome Project
    • Sequenced 1/3 of the human genome
  • Data made publicly available by websites, ftp, direct DB access, APIs
  • 1000 Genome Study → UK10k
  • >45 PB raw storage

(C) 2015 THE IRODS CONSORTIUM

The Wellcome Trust Sanger Institute

(C) 2015 THE IRODS CONSORTIUM

The Wellcome Trust Sanger Institute

(C) 2015 THE IRODS CONSORTIUM

  • Data preferentially placed on resource servers in the green data center (fallback to red)

 

  • Data replicated to the other room.

 

  • Checksums applied.

 

  • Green and red centers both used for read access.

 

 

The Wellcome Trust Sanger Institute

(C) 2015 THE IRODS CONSORTIUM

  • Example metadata attributes

 

  • Users query and access data from local compute clusters.

 

  • Users access iRODS locally via the command line interface.

 

attribute: library

attribute: total_reads

attribute: type

attribute: lane

attribute: is_paired_read

attribute: study_accession_number

attribute: library_id

attribute: sample_accession_number

attribute: sample_public_name

attribute: manual_qc

attribute: tag

attribute: sample_common_name

attribute: md5

attribute: tag_index

attribute: study_title

attribute: study_id

attribute: reference

attribute: sample

attribute: target

attribute: sample_id

attribute: id_run

attribute: study

attribute: alignment

 

 

 

 

The Wellcome Trust Sanger Institute

(C) 2015 THE IRODS CONSORTIUM

The Wellcome Trust Sanger Institute

Baton Client

 

Thin layer over parts of the iRODS C API

● JSON support

● Connection friendly

● Comprehensive logging

● autoconf build on Linux and OSX

 

Current state

● Metadata listing

● Metadata queries

● Metadata addition

 

https://github.com/wtsi-npg/baton.git

(C) 2015 THE IRODS CONSORTIUM

How is iRODS Used?

Additional Use Cases

(C) 2015 THE IRODS CONSORTIUM

Other Use Cases

A Health Science Institute

  • Landing Zone for automatic staging to/from HPC
  • Metadata extraction, hierarchical metadata
  • Automatic permission management

 

NIEHS

  • Automating lab processes
  • Report generation

(C) 2015 THE IRODS CONSORTIUM

The iRODS Consortium

(C) 2015 THE IRODS CONSORTIUM

Welcome to the

 iRODS Consortium!

iRODS is free, open source software owned by a foundation called the iRODS Consortium.

  • Members pay an annual membership fee: 4 levels of membership.
  • Members have agreed upon iRODS as an area of cooperation, rather than competition.
  • Two monthly meetings: Technology Working Group (TWG), Planning Committee

(C) 2015 THE IRODS CONSORTIUM

Consortium Initiatives

 

  • Professional services, training, and support.
  • iRODS Partners Program
  • iRODS Hub

 

 

 

  • iRODS User Group Meeting 2015
    • Chapel Hill, NC
    • Training on June 9th
    • Presentations on June 10th and 11th

ugm2015.irods.org

(C) 2015 THE IRODS CONSORTIUM

Getting Started

(C) 2015 THE IRODS CONSORTIUM

The iRODS Adoption Model

(C) 2015 THE IRODS CONSORTIUM

Initial Trial

  • Documentation, training
  • Blog posts, social media
  • Cloud images
  • Google Group
  • iRODS Hub

Proof of Concept

  • Occasional 1-on-1 Support
  • Service Contract

Pilot

  • iRODS Partners
  • Service Contract

Production

  • Consortium Membership
  • iRODS Partners
  • Service Contract

How to Assist

  1. Learn about iRODS at iRODS.org
  2. Identify the Need
  3. Demonstrate a Sample Application
    • The Consortium Can Help!
  4. Collaborate
  5. Co-Deployment

(C) 2015 THE IRODS CONSORTIUM

Recognizing iRODS Candidates

  • DevOps
  • Long term storage
  • Genomics, life sciences
  • >500 TB of data
  • Mixed storage environment
  • "Collaboration" between multiple (sub-)organizations
  • "My HSM system isn't smart enough"
  • "My scheduler doesn't talk to my storage system"
  • "provenance," "metadata"

(C) 2015 THE IRODS CONSORTIUM

iRODS 4.1 and Beyond

(C) 2015 THE IRODS CONSORTIUM

iRODS 4.1

  • Expect to release by June 2015

 

  • Hardening the legacy code base
    • Coverity Clean: Over 1100 stability, reliability fixes
    • JSON-Based configuration replaces scattered config files
    • Zone Introspection: Which servers are in my Zone?
    • Control Plane: Orderly shutdown

 

  • Customer-Requested Features
    • Atomic data-metadata puts
    • Key-Value passthrough from the command line

 

(C) 2015 THE IRODS CONSORTIUM

iRODS 4.2

  • Messaging framework: i/f to external services
    • e.g., Solr for full content indexing
  • Pluggable rule engine
  • Next generation API
    • Simpler client development, beginning with put/get/query
    • Movement toward object semantics, move POSIX to the resource plugins
    • Pluggable transport: iRODS to broker a connection, then get out of the way 
  • User Interface improvements, in coordination with DFC collaborators

 

And Beyond...

  • Eventually, iRODS core is mainly connecting plugins: Need plugin registry and dependency model
  • Using the new API to create new plugins, clients, interfaces

(C) 2015 THE IRODS CONSORTIUM

Developing Resource Plugins

(C) 2015 THE IRODS CONSORTIUM

Thank You!

 

Jason Coposky

jasonc@renci.org

 

Dan Bedard

danb@renci.org

(C) 2015 THE IRODS CONSORTIUM

Made with Slides.com