Terrell Russell, Ph.D.

@terrellrussell

Executive Director, iRODS Consortium

iRODS and eHealth

June 5-9, 2023

TNC23

Tirana, Albania

Our Membership

Consortium

Member

Consortium

Member

Consortium

Member

Today's Talk

  • An overview of the iRODS platform, its capabilities and existing integrations

 

  • A few use cases of policy-driven health research data management solutions in production around the world

 

  • Some philosophy about building systems to last a long time

What is iRODS

Open Source

  • C++ client-server architecture
  • BSD-3 Licensed, install it today and try before you buy

 

Distributed

  • Runs on a laptop, a cluster, on premises or geographically distributed

 

Data Centric & Metadata Driven

  • Insulate both your users and your data from your infrastructure

Philosophical Drivers

  • 100-year view

 

  • Plugin Architecture

    • core is generic - protocol, api, bookkeeping

    • plugins are specific

    • policy composition

​​

  • Modern core libraries

    • standardized interfaces

    • refactored iRODS internals

      • ease of (re)use

      • fewer bugs

  • Configuration, Not Code

Why use iRODS?

People need a solution for:

  • Managing large amounts of data across various storage technologies
  • Controlling access to data
  • Searching their data quickly and efficiently
  • Automation

 

The larger the organization, the more they need software like iRODS.

iRODS as the Integration Layer

iRODS Core Competencies

  • Packaged and supported solutions
  • Require configuration not code
  • Derived from the majority of use cases observed in the user community

iRODS Capabilities

The Data Management Model

Ingest to Institutional Repository

As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.

Today's Talk

  • An overview of the iRODS platform, its capabilities and existing integrations

 

  • A few use cases of policy-driven health research data management solutions in production around the world

 

  • Some philosophy about building systems to last a long time

The Wellcome Sanger Institute

Sanger - Replication

  • Data preferentially placed on resource servers in the green data center (fallback to red)
  • Data replicated to the other room.
  • Checksums applied
  • Green and red centers both used for read access.

Sanger - Metadata

attribute: library

attribute: total_reads

attribute: type

attribute: lane

attribute: is_paired_read

attribute: study_accession_number

attribute: library_id

attribute: sample_accession_number

attribute: sample_public_name

attribute: manual_qc

attribute: tag

attribute: sample_common_name

attribute: md5

attribute: tag_index

attribute: study_title

attribute: study_id

attribute: reference

attribute: sample

attribute: target

attribute: sample_id

attribute: id_run

attribute: study

attribute: alignment

  • Example metadata attributes
  • Users query and access data from local compute clusters
  • Users access iRODS locally via the command line interface

Sanger - Federation

Maastricht DataHub

Maastricht DataHub

Berlin Institute of Health (BIH)

Berlin Institute of Health (BIH)

GA4GH Integration

GA4GH Data Repository Service (DRS) for iRODS

 

https://github.com/michael-conway/irods-ga4gh-dos

The GA4GH Data Repository Service (DRS) standard is part of a family of standards for distributed, federated data analysis. Using standard workflow languages such as WDL, CWL, and Nextflow, these standards allow workflows to dispatch containerized tasks to run at appropriate locations, including across cloud providers and on-prem compute environments. The DRS standard provides an abstraction over distributed data sources, allowing these workflow tasks to authorize data access and access underlying data sets.

A DRS implementation over iRODS allows the iRODS data grid to expose data to this federated analysis ecosystem. The Federated Analysis System Project (FASP) components represent a formalization of the iRODS 'compute to data' pattern for the important Genomics and Health community.

Today's Talk

  • An overview of the iRODS platform, its capabilities and existing integrations

 

  • A few use cases of policy-driven health research data management solutions in production around the world

 

  • Some philosophy about building systems to last a long time

Engineering Tradeoffs

Building these systems is always a series of decisions made in an environment with multiple constraints.

 

A flexible solution is necessary.

Protocol Plumbing

  • WebDAV
  • FUSE
  • REST
  • NFS
  • SFTP
  • K8s CSI
  • S3

Imaging Working Group

Goal: To provide a standardized suite of imaging policies and practices for integration with existing tools and pipelines

 

  • Open Microscopy Environment (and OMERO)
  • Neuroscience Microscopy Core at UNC School of Medicine
  • New York University
  • Santa Clara University
  • UC San Diego
  • UC Santa Cruz
  • UMass
  • Harvard
  • Maastricht University
  • Wellcome Sanger Institute
  • CyVerse
  • NIEHS
  • Netherlands Cancer Institute (NKI)
  • Francis Crick Institute
  • Fritz Lipmann Institute
  • Osnabrück University
  • RIKEN

Big Picture

iRODS is a flexible platform for building eHealth solutions 

  • Discovery - user-defined metadata catalog
  • Auditing - bookkeeping across disparate systems
  • Policy - full, programmatic environment

Big Picture

Proper data management requires policy enforcement.

 

These policies will change over time.

 

Open source is the best practice for a 100-year view.

Join Us Next Week - iRODS UGM2023

TNC23 - iRODS and eHealth

By iRODS Consortium

TNC23 - iRODS and eHealth

  • 392