iRODS Proof of Concept
Jason Coposky
@jason_coposky
Executive Director, iRODS Consortium
iRODS Proof of Concept
Jason Coposky
@jason_coposky
Executive Director, iRODS Consortium
March 15, 2018
What is iRODS
iRODS is
- Distributed
- Open source
- Metadata Driven
- Data Centric
A flexible framework for the abstraction of infrastructure
iRODS as the Integration Layer
Data Virtualization
Combine various distributed storage technologies into a Unified Namespace
- Existing file systems
- Cloud storage
- On premises object storage
- Archival storage systems
iRODS provides a logical view into the complex physical representation of your data, distributed geographically, and at scale.
Data Discovery
Attach metadata to any first class entity within the iRODS Zone
- Data Objects
- Collections
- Users
- Storage Resources
- The Namespace
iRODS provides automated and user-provided metadata which makes your data and infrastructure more discoverable, operational and valuable.
Workflow Automation
Plugin framework supporting many languages, triggered by any operation within the system
- Authentication
- Storage Access
- Database Interaction
- Network Activity
- Extensible RPC API
iRODS rule engines provide the ability to capture real world policy as computer actionable rules which may allow, deny, or add context to operations within the system.
Secure Collaboration
iRODS allows for collaboration across administrative boundaries after deployment
- No need for common infrastructure
- No need for shared funding
- Affords temporary collaborations
iRODS provides the ability to federate namespaces across organizations without pre-coordinated funding or effort.
Institutional repositories
As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.
Example
Use Case
iRODS
The Wellcome Trust Sanger Institute
Sanger - Replication
- Data preferentially placed on resource servers in the green data center (fallback to red)
- Data replicated to the other room.
- Checksums applied
- Green and red centers both used for read access.
Sanger - Metadata
attribute: library
attribute: total_reads
attribute: type
attribute: lane
attribute: is_paired_read
attribute: study_accession_number
attribute: library_id
attribute: sample_accession_number
attribute: sample_public_name
attribute: manual_qc
attribute: tag
attribute: sample_common_name
attribute: md5
attribute: tag_index
attribute: study_title
attribute: study_id
attribute: reference
attribute: sample
attribute: target
attribute: sample_id
attribute: id_run
attribute: study
attribute: alignment
- Example metadata attributes
- Users query and access data from local compute clusters
- Users access iRODS locally via the command line interface
Sanger - Federation
Currently Deployment
Current
Proof of Concept
On Premises to Any Cloud Infrastructure
Current Infrastructure
- PostgreSQL Database
- iRODS Catalog Provider
- File system Scanner
Single 4 core VM hosting:
iRODS is presenting multiple NFS volumes and S3 buckets
File system Scanning
Cloud Synchronization
- Data in S3 buckets was registered in place
- Data in NFS is scanned and registered in place
- Data in NFS is considered the authoritative replica - S3 replica is marked Stale
- Capture file system metadata
- Capture file type metadata
- If size or checksum mismatch
- Log out of date replica
- future work - automatically synchronize to S3
iRODS rule base - registration policy
Ingest Pipeline
iRODS Clients
Example
Interfaces
Cloud Browser - Home Collection
Cloud Browser - Search
Cloud Browser - Results
Cloud Browser - Metadata
Command Line
Unix like utilities which interact with the server
- iput - ingest data
- iget - extract data
- ils - list logical collections
- iquest - query catalog with sql-like language
- imeta - set and query metadata
- Many more specialized commands...
iRODS Proof of Concept
By jason coposky
iRODS Proof of Concept
- 1,113