Justin James

Applications Engineer

iRODS Consortium

June 2, 2022

Great Plains Network Annual Meeting

Kansas City, MO

Managing Petabytes of Data

Using iRODS

Managing Petabytes of Data

Using iRODS

Our Membership

Consortium

Member

Consortium

Member

Consortium

Member

Consortium

Member

Why use iRODS?

People need a solution for:

  • Managing large amounts of data across various storage technologies
  • Controlling access to data
  • Searching their data quickly and efficiently
  • Automation

 

The larger the organization, the more they need software like iRODS.

Why use iRODS?  (Too Much Data)

"90% of the world's data created within the last two years"

  • Coming in too fast

  • Without good source information

  • Getting stored wherever there is room

  • Getting lost

  • Getting corrupted

  • Getting forgotten

Why use iRODS?  (Data Management Requirements)

Sample Project Requirements

 

For 10+ years, data must be:

  • Verified

  • Migrated

  • Kept in Duplicate

  • Made Accessible

  • Made Searchable

  • Monitored

Why use iRODS?

  • These long-term management tasks are too much for a curator or librarian, and certainly too much for the data scientists themselves to handle by hand.
     

  • There must be organizational policy in place to handle the varied scenarios of data retention, data access, and data use.
     

  • There must be automation in place to provide consistency and confidence in the process.

Why use iRODS?  (Data Management Requirements)

iRODS as the Integration Layer

The iRODS Data Management Model

Core Competencies

Policy

Capabilities

Patterns

iRODS Core Competencies

The underlying technology categorized into four areas

Data Virtualization

Combine various distributed storage technologies into a Unified Namespace

  • Existing file systems
  • Cloud storage
  • On premises object storage
  • Archival storage systems

iRODS provides a logical view into the complex physical representation of your data, distributed geographically, and at scale.

Data Virtualization - Projection of the Physical into the Logical

Logical Path

Physical Path(s)

Data Discovery

Attach metadata to any first class entity within the iRODS Zone

  • Data Objects
  • Collections
  • Users
  • Storage Resources
  • The Namespace

iRODS provides automated and user-provided metadata which makes your data and infrastructure more discoverable, operational and valuable.

Workflow Automation

Integrated scripting language which is triggered by any operation within the framework

  • Authentication
  • Storage Access
  • Database Interaction
  • Network Activity
  • Extensible RPC API 

The iRODS rule engine provides the ability to capture real world policy as computer actionable rules which may allow, deny, or add context to operations within the system.

Workflow Automation - Dynamic Policy Enforcement Points

Workflow Automation - Resource Hierarchies

  • A built-in special case for data storage policies.  Easy configuration via a resource tree.
     
  • Coordinating resources are the non-leaf nodes
    • They do not represent a physical storage system
    • They control the storage and retrieval policies
    • Examples:
      • Replication - Data replicated to all subtrees
      • Random - Data written to a random subtree
      • Compound - Has a cache and archive leaf
         
  • Storage resources are the leaf nodes.  They can represent POSIX filesystems, object stores (S3), or tape.

Secure Collaboration

iRODS allows for collaboration across administrative boundaries after deployment

  • No need for common infrastructure
  • No need for shared funding
  • Affords temporary collaborations

iRODS provides the ability to federate namespaces across organizations without pre-coordinated funding or effort.

  • Packaged and supported solutions
  • Require configuration not code
  • Derived from the majority of use cases observed in the user community

iRODS Capabilities

iRODS Capabilities Example - Storage Tiering

iRODS Capabilities Example - Storage Tiering

iRODS Capabilities Example - Storage Tiering

  • Easy to Set Up

    • Install the plugin (apt, yum)

    • Enable the plugin

    • Add resources (if they don't already exist)

    • Add some metadata on the resources (configuration)

      • Assign resource to tier group

      • Set tier time constraints

  • No code required although tier violation queries can be customized

iRODS Capabilities Example - Automated Ingest

Monitors a filesystem and copies or registers files into iRODS

  • Also follows the configuration not code paradigm
Operation New Files Updated Files

Operation.REGISTER_SYNC (default)

registers in catalog updates size in catalog
Operation.REGISTER_AS_REPLICA_SYNC

registers first or additional replica

updates size in catalog

Operation.PUT

copies file to target vault, and registers in catalog

no action

Operation.PUT_SYNC

copies file to target vault, and registers in catalog

copies entire file again, and updates catalog
Operation.PUT_APPEND

copies file to target vault, and registers in catalog

copies only appended part of file, and updates catalog

The Data Management Model

Data Storage Demo

As a demonstration of the concepts I have introduced, we will start with a simple file replication example.

 

First, I will create a resource tree with only unix filesystem resources.

  • Note that in a real system these resources would likely be on different geographically-separated servers.
$ iadmin mkresc resc1 unixfilesystem `hostname`:`pwd`/resc1
$ iadmin mkresc resc2 unixfilesystem `hostname`:`pwd`/resc2
$ iadmin mkresc resc3 unixfilesystem `hostname`:`pwd`/resc3
$ iadmin mkresc replresc replication
$ iadmin addchildtoresc replresc resc1
$ iadmin addchildtoresc replresc resc2
$ iadmin addchildtoresc replresc resc3
$ ilsresc
demoResc:unixfilesystem
replresc:replication
├── resc1:unixfilesystem
├── resc2:unixfilesystem
└── resc3:unixfilesystem

Data Storage Demo

Now let's put a few files into the system.          Note three replicas of each.

$ truncate --size 120M f1
$ truncate --size 120M f2
$ truncate --size 120M f3
$ iput -R replresc f1
$ iput -R replresc f2
$ iput -R replresc f3
$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3

Data Storage Demo

But we really want to store data in the cloud.

 

Let's create an S3 resource:

$ iadmin mkresc s3resc s3 `hostname`:/justinkylejames-irods1/s3resc "S3_DEFAULT_HOSTNAME=s3.amazonaws.com;S3_AUTH_FILE=/var/lib/irods/amazon.keypair;S3_REGIONNAME=us-east-1;S3_RETRY_COUNT=3;S3_WAIT_TIME_SEC=3;S3_PROTO=HTTP;HOST_MODE=cacheless_attached;S3_ENABLE_MD5=1;S3_SIGNATURE_VERSION=4;S3_ENABLE_MPU=1;ARCHIVE_NAMING_POLICY=consistent;S3_CACHE_DIR=/var/lib/irods;CIRCULAR_BUFFER_SIZE=2;DEBUG_LOGGING=true;S3_STSDATE=both"

$ ilsresc

demoResc
s3resc:s3
​replresc:replication
├── resc1:unixfilesystem
├── resc2:unixfilesystem
└── resc3:unixfilesystem

Data Storage Demo

Now let's create a couple of large files and write them to our S3 bucket.

$ truncate --size 1G f4
$ truncate --size 1G f5
$ iput -R s3resc f4
$ iput -R s3resc f5
$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3
  rods              0 s3resc   1073741824 2022-05-17.18:38 & f4
  rods              0 s3resc   1073741824 2022-05-17.18:39 & f5


Data Storage Demo

But I would like f4 and f5 to also exist on the local resources so let's force a replication...

irods@cf4921416f3a:~$ irepl -R replresc f4
irods@cf4921416f3a:~$ irepl -R replresc f5
irods@cf4921416f3a:~$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3
  rods              0 s3resc   1073741824 2022-05-17.18:38 & f4
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f4
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f4
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f4
  rods              0 s3resc   1073741824 2022-05-17.18:39 & f5
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f5
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f5
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f5

Data Storage Demo

Let's replicate f1 to S3.

$ irepl -R s3resc f1                  
$ ils -l                              
/tempZone/home/rods:                                     
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              3 s3resc    125829120 2022-05-17.18:43 & f1       
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3                rods              0 s3resc   1073741824 2022-05-17.18:38 & f4                         
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f4
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f4
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f4
  rods              0 s3resc   1073741824 2022-05-17.18:39 & f5
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f5
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f5
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f5

Data Storage Demo - Automatic Replication

But maybe we just want the S3 resource to automatically replicate to the others and vice versa

$ iadmin addchildtoresc replresc s3resc
$ ilsresc
demoResc:unixfilesystem
replresc:replication
├── resc1:unixfilesystem
├── resc2:unixfilesystem
├── resc3:unixfilesystem
└── s3resc:s3

Data Storage Demo

We will rebalance the replresc so everything in that tree exists everywhere since some objects were created before we modified the tree.

irods@cf4921416f3a:~$ iadmin modresc replresc rebalance
irods@cf4921416f3a:~$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              3 replresc;s3resc    125829120 2022-05-17.18:43 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              3 replresc;s3resc    125829120 2022-05-17.18:50 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3
  rods              3 replresc;s3resc    125829120 2022-05-17.18:50 & f3
  rods              0 replresc;s3resc   1073741824 2022-05-17.18:38 & f4
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f4
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f4
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f4
  rods              0 replresc;s3resc   1073741824 2022-05-17.18:39 & f5
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f5
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f5
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f5

S3 Performance - Upload

  • 10 transfer threads (each)

 

  • Uploads every 100 MB between 100 MB and 3200 MB

 

  • Median time shown of 5 uploads for each size

S3 Performance - Download

  • 10 transfer threads (each)

 

  • Downloads every 100 MB between 100 MB and 3200 MB

 

  • Median time shown of 5 downloads for each size

GPN 2022 - Managing Petabytes of Data Using iRODS

By iRODS Consortium

GPN 2022 - Managing Petabytes of Data Using iRODS

  • 698