Justin James

Applications Engineer

iRODS Consortium

June 2, 2022

Great Plains Network Annual Meeting

Kansas City, MO

Managing Petabytes of Data

Using iRODS

Managing Petabytes of Data

Using iRODS

Our Membership

Consortium

Member

Consortium

Member

Consortium

Member

Consortium

Member

Why use iRODS?

People need a solution for:

Managing large amounts of data across various storage technologies
Controlling access to data
Searching their data quickly and efficiently
Automation

The larger the organization, the more they need software like iRODS.

Why use iRODS? (Too Much Data)

"90% of the world's data created within the last two years"

Coming in too fast
Without good source information
Getting stored wherever there is room
Getting lost
Getting corrupted
Getting forgotten

Why use iRODS? (Data Management Requirements)

Sample Project Requirements

For 10+ years, data must be:

Verified
Migrated
Kept in Duplicate
Made Accessible
Made Searchable
Monitored

Why use iRODS?

These long-term management tasks are too much for a curator or librarian, and certainly too much for the data scientists themselves to handle by hand.
There must be organizational policy in place to handle the varied scenarios of data retention, data access, and data use.
There must be automation in place to provide consistency and confidence in the process.

Why use iRODS? (Data Management Requirements)

iRODS as the Integration Layer

The iRODS Data Management Model

Core Competencies

Policy

Capabilities

Patterns

iRODS Core Competencies

The underlying technology categorized into four areas

Data Virtualization

Combine various distributed storage technologies into a Unified Namespace

Existing file systems
Cloud storage
On premises object storage
Archival storage systems

iRODS provides a logical view into the complex physical representation of your data, distributed geographically, and at scale.

Data Virtualization - Projection of the Physical into the Logical

Logical Path

Physical Path(s)

Data Discovery

Attach metadata to any first class entity within the iRODS Zone

Data Objects
Collections
Users
Storage Resources
The Namespace

iRODS provides automated and user-provided metadata which makes your data and infrastructure more discoverable, operational and valuable.

Workflow Automation

Integrated scripting language which is triggered by any operation within the framework

Authentication
Storage Access
Database Interaction
Network Activity
Extensible RPC API

The iRODS rule engine provides the ability to capture real world policy as computer actionable rules which may allow, deny, or add context to operations within the system.

Workflow Automation - Dynamic Policy Enforcement Points

Workflow Automation - Resource Hierarchies

A built-in special case for data storage policies. Easy configuration via a resource tree.
Coordinating resources are the non-leaf nodes
- They do not represent a physical storage system
- They control the storage and retrieval policies
- Examples:
  - Replication - Data replicated to all subtrees
  - Random - Data written to a random subtree
  - Compound - Has a cache and archive leaf
Storage resources are the leaf nodes. They can represent POSIX filesystems, object stores (S3), or tape.

Secure Collaboration

iRODS allows for collaboration across administrative boundaries after deployment

No need for common infrastructure
No need for shared funding
Affords temporary collaborations

iRODS provides the ability to federate namespaces across organizations without pre-coordinated funding or effort.

Packaged and supported solutions
Require configuration not code
Derived from the majority of use cases observed in the user community

iRODS Capabilities

iRODS Capabilities Example - Storage Tiering

Easy to Set Up
- Install the plugin (apt, yum)
- Enable the plugin
- Add resources (if they don't already exist)
- Add some metadata on the resources (configuration)
  - Assign resource to tier group
  - Set tier time constraints
No code required although tier violation queries can be customized

iRODS Capabilities Example - Automated Ingest

Monitors a filesystem and copies or registers files into iRODS

Also follows the configuration not code paradigm

Operation	New Files	Updated Files
Operation.REGISTER_SYNC (default)	registers in catalog	updates size in catalog
Operation.REGISTER_AS_REPLICA_SYNC	registers first or additional replica	updates size in catalog
Operation.PUT	copies file to target vault, and registers in catalog	no action
Operation.PUT_SYNC	copies file to target vault, and registers in catalog	copies entire file again, and updates catalog
Operation.PUT_APPEND	copies file to target vault, and registers in catalog	copies only appended part of file, and updates catalog

The Data Management Model

Data Storage Demo

As a demonstration of the concepts I have introduced, we will start with a simple file replication example.

First, I will create a resource tree with only unix filesystem resources.

Note that in a real system these resources would likely be on different geographically-separated servers.

$ iadmin mkresc resc1 unixfilesystem `hostname`:`pwd`/resc1

$ iadmin mkresc resc2 unixfilesystem `hostname`:`pwd`/resc2

$ iadmin mkresc resc3 unixfilesystem `hostname`:`pwd`/resc3

$ iadmin mkresc replresc replication

$ iadmin addchildtoresc replresc resc1

$ iadmin addchildtoresc replresc resc2

$ iadmin addchildtoresc replresc resc3
$ ilsresc
demoResc:unixfilesystem
replresc:replication
├── resc1:unixfilesystem
├── resc2:unixfilesystem
└── resc3:unixfilesystem

Data Storage Demo

Now let's put a few files into the system. Note three replicas of each.

$ truncate --size 120M f1
$ truncate --size 120M f2
$ truncate --size 120M f3
$ iput -R replresc f1
$ iput -R replresc f2
$ iput -R replresc f3
$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3

Data Storage Demo

But we really want to store data in the cloud.

Let's create an S3 resource:

$ iadmin mkresc s3resc s3 `hostname`:/justinkylejames-irods1/s3resc "S3_DEFAULT_HOSTNAME=s3.amazonaws.com;S3_AUTH_FILE=/var/lib/irods/amazon.keypair;S3_REGIONNAME=us-east-1;S3_RETRY_COUNT=3;S3_WAIT_TIME_SEC=3;S3_PROTO=HTTP;HOST_MODE=cacheless_attached;S3_ENABLE_MD5=1;S3_SIGNATURE_VERSION=4;S3_ENABLE_MPU=1;ARCHIVE_NAMING_POLICY=consistent;S3_CACHE_DIR=/var/lib/irods;CIRCULAR_BUFFER_SIZE=2;DEBUG_LOGGING=true;S3_STSDATE=both"

$ ilsresc

demoResc
s3resc:s3
replresc:replication
├── resc1:unixfilesystem
├── resc2:unixfilesystem
└── resc3:unixfilesystem

Data Storage Demo

Now let's create a couple of large files and write them to our S3 bucket.

$ truncate --size 1G f4
$ truncate --size 1G f5

$ iput -R s3resc f4
$ iput -R s3resc f5
$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3
  rods              0 s3resc   1073741824 2022-05-17.18:38 & f4
  rods              0 s3resc   1073741824 2022-05-17.18:39 & f5

Data Storage Demo

But I would like f4 and f5 to also exist on the local resources so let's force a replication...

irods@cf4921416f3a:~$ irepl -R replresc f4
irods@cf4921416f3a:~$ irepl -R replresc f5
irods@cf4921416f3a:~$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3
  rods              0 s3resc   1073741824 2022-05-17.18:38 & f4
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f4
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f4
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f4
  rods              0 s3resc   1073741824 2022-05-17.18:39 & f5
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f5
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f5
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f5

Data Storage Demo

Let's replicate f1 to S3.

$ irepl -R s3resc f1                  
$ ils -l                              
/tempZone/home/rods:                                     
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              3 s3resc    125829120 2022-05-17.18:43 & f1       
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3                rods              0 s3resc   1073741824 2022-05-17.18:38 & f4

  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f4

  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f4
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f4
  rods              0 s3resc   1073741824 2022-05-17.18:39 & f5
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f5
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f5
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f5

Data Storage Demo - Automatic Replication

But maybe we just want the S3 resource to automatically replicate to the others and vice versa

$ iadmin addchildtoresc replresc s3resc
$ ilsresc
demoResc:unixfilesystem
replresc:replication
├── resc1:unixfilesystem
├── resc2:unixfilesystem
├── resc3:unixfilesystem
└── s3resc:s3

Data Storage Demo

We will rebalance the replresc so everything in that tree exists everywhere since some objects were created before we modified the tree.

irods@cf4921416f3a:~$ iadmin modresc replresc rebalance
irods@cf4921416f3a:~$ ils -l
/tempZone/home/rods:
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f1
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f1
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f1
  rods              3 replresc;s3resc    125829120 2022-05-17.18:43 & f1
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f2
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f2
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f2
  rods              3 replresc;s3resc    125829120 2022-05-17.18:50 & f2
  rods              0 replresc;resc2    125829120 2022-05-17.18:23 & f3
  rods              1 replresc;resc3    125829120 2022-05-17.18:23 & f3
  rods              2 replresc;resc1    125829120 2022-05-17.18:23 & f3
  rods              3 replresc;s3resc    125829120 2022-05-17.18:50 & f3
  rods              0 replresc;s3resc   1073741824 2022-05-17.18:38 & f4
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f4
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f4
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f4
  rods              0 replresc;s3resc   1073741824 2022-05-17.18:39 & f5
  rods              1 replresc;resc2   1073741824 2022-05-17.18:41 & f5
  rods              2 replresc;resc3   1073741824 2022-05-17.18:41 & f5
  rods              3 replresc;resc1   1073741824 2022-05-17.18:41 & f5

S3 Performance - Upload

10 transfer threads (each)

Uploads every 100 MB between 100 MB and 3200 MB

Median time shown of 5 uploads for each size

S3 Performance - Download

10 transfer threads (each)

Downloads every 100 MB between 100 MB and 3200 MB

Median time shown of 5 downloads for each size

GPN 2022 - Managing Petabytes of Data Using iRODS

By iRODS Consortium

GPN 2022 - Managing Petabytes of Data Using iRODS

GPN 2022 - Managing Petabytes of Data Using iRODS

More from iRODS Consortium