Introduction to

Data Management

July 20, 2018

NSF CI CyberCarpentry Workshop

UNC-Chapel Hill

Terrell Russell, Ph.D.

@terrellrussell

Chief Technologist, iRODS Consortium

Introduction to

Data Management

Motivation

  • Many petabytes of data are being generated
    • constantly
    • by every type of organization
    • globally
  • Infrastructure is constantly changing
  • Science is increasingly driven by data and software
  • Reproducible scientific results is critical to progress
  • Collaboration within and across organizational boundaries accelerates discovery
  • Necessary to demonstrate compliance to security standards

 

What is the system and model to solve these problems in an automated way?

Motivation

A Definition of Data Management

 

"The development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."

 

 


Organizations need a future-proof solution to managing data and its surrounding infrastructure.

What is iRODS

iRODS is

  • Distributed
  • Open source
  • Metadata Driven
  • Data Centric

 

A flexible framework for the abstraction of infrastructure

iRODS as the Integration Layer

Creating the Machine

What are the necessary component to build such a system?

 

  • Where do we maintain state?
  • How do we handle command and control?
  • How do we manage data movement?
  • What guarantees can be made regarding semantics?

iRODS Architecture

  • Metadata Catalog - where we write everything down
  • Catalog Service Provider - provides access to the Catalog
  • Catalog Service Consumer - distributed nodes to provide access to storage and other resources

 

iRODS is ultimately a Catalog and an RPC API

Catalog Service Consumers

Servers which provide access to storage resources

  • Connect to the Catalog Service Provider for
    • resource configuration
    • authentication
    • system metadata
    • user-assigned metadata
  • Provide scalable access to iRODS services
  • May be geographically distributed
  • May have an arbitrary number of resources attached

Catalog Service Provider

Same capabilities as the Consumer with the addition of a database plugin

  • May serve storage capabilities
  • Provides access to the metadata catalog
  • May be placed in a High Availability configuration for failover and load balancing

Metadata Catalog

  • Relational Database
    • PostgreSQL, MySQL/MariaDB, Oracle, CockroachDB
  • Single source of truth for the Zone
  • Holds users, groups, resources, system metadata, user metadata
  • Co-resident with iRODS or a clustered server farm
  • Referenced by a database plugin implemented with ODBC

 

What are some limitations of this design?

Limitations of System Design

  • Catalog Service Provider represents a single point of failure
  • The Catalog may be corrupted or fail entirely
  • Data may be made unavailable by a server failure
  • Storage may be corrupted or fail entirely

Addressing System Limitations

  • Catalog Service Provider represents a single point of failure
    • Cluster behind a Proxy
  • The Catalog may be corrupted or fail entirely
    • Cluster with replication or multi-master
  • Data may be made unavailable by a server failure
    • Provide replication of data for durability
  • Storage may be corrupted or fail entirely
    • Replication and Backup strategies

What to consider in an iRODS deployment

  • Number of users and expected simultaneous connections
  • Expected ingest rate
  • Sizes of files
    • many small
  • Partial read / write vs whole file usage
  • Replication for durability
  • Replication for locality of reference
  • Load balancing vs High Availability

 

iRODS will run on a laptop or a rack of servers

 

What does iRODS actually do?

Core Competencies

Data Virtualization

Combine various distributed storage technologies into a Unified Namespace

  • Existing file systems
  • Cloud storage
  • On premises object storage
  • Archival storage systems

iRODS provides a logical view into the complex physical representation of your data, distributed geographically, and at scale.

Data Virtualization

Logical Path

Physical Path(s)

Data Virtualization

$ ils -L /tempZone/home/rods/thefile.txt
  rods              0 demoResc        29606 2016-10-05.09:05 & thefile.txt
        generic    /var/lib/irods/iRODS/Vault/home/rods/thefile.txt
  rods              1 repl;u2        29606 2016-10-05.09:06 & thefile.txt
        generic    /tmp/u2vault/home/rods/thefile.txt
  rods              2 repl;u1        29606 2016-10-05.09:06 & thefile.txt
        generic    /tmp/u1vault/home/rods/thefile.txt

 

Logical Path /tempZone/home/rods/thefile.txt
Physical Paths /var/lib/irods/iRODS/Vault/home/rods/thefile.txt
/tmp/u2vault/home/rods/thefile.txt
/tmp/u1vault/home/rods/thefile.txt

Data Discovery

Attach metadata to any first class entity within the iRODS Zone

  • Data Objects
  • Collections
  • Users
  • Storage Resources
  • The Namespace

iRODS provides automated and user-provided metadata which makes your data and infrastructure more discoverable, operational and valuable.

Metadata Everywhere

Workflow Automation

Integrated policy engine which is triggered by any operation within the framework

  • Authentication
  • Storage Access
  • Database Interaction
  • Network Activity
  • Extensible RPC API 

The iRODS rule engine provides the ability to capture real world policy as computer actionable rules which may allow, deny, or add context to operations within the system.

Dynamic Policy Enforcement

The iRODS rule may:

  • restrict access
  • log for audit and reporting
  • provide additional context
  • send a notification

Dynamic Policy Enforcement

A single API call expands to many plugin operations all of which may invoke policy enforcement

Plugin Interfaces:

  • Authentication
  • Database
  • Storage
  • Network
  • Rule Engine
  • Microservice
  • RPC API

Secure Collaboration

iRODS allows for collaboration across administrative boundaries after deployment

  • No need for common infrastructure
  • No need for shared funding
  • Affords temporary collaborations

iRODS provides the ability to federate namespaces across organizations without pre-coordinated funding or effort.

iRODS Service Interface

Federation - Shared Data and Services

Institutional repositories

As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.

Patterns and

Use Cases

iRODS

On Premises to Any Cloud Infrastructure

Data to Compute Use Case

Compute to Data Use Case

The Wellcome Sanger Institute

Sanger - Replication

  • Data preferentially placed on resource servers in the green data center (fallback to red)
  • Data replicated to the other room.
  • Checksums applied
  • Green and red centers both used for read access.

Sanger - Metadata

attribute: library

attribute: total_reads

attribute: type

attribute: lane

attribute: is_paired_read

attribute: study_accession_number

attribute: library_id

attribute: sample_accession_number

attribute: sample_public_name

attribute: manual_qc

attribute: tag

attribute: sample_common_name

attribute: md5

attribute: tag_index

attribute: study_title

attribute: study_id

attribute: reference

attribute: sample

attribute: target

attribute: sample_id

attribute: id_run

attribute: study

attribute: alignment

  • Example metadata attributes
  • Users query and access data from local compute clusters
  • Users access iRODS locally via the command line interface

Sanger - Federation

University College London

  • UK sponsored research requirements:
    • last date of access request plus 10 years
  • iRODS tiers data across storage technologies
  • Enables federated access from other centers

Roadmap

iRODS Software

The Roadmap

  • iRODS 4.3
  • Packaged iRODS Capabilities
  • Multipart Transfer
    • Cacheless Object Storage
  • Query Arrow
  • Metadata Templates
  • Filesystem Integration

The Roadmap - iRODS 4.3

  • Hardening Release
  • Logging
  • iRODS Monitor
  • Delegate Checksum to Storage Plugins

Packaged iRODS Capabilities

Multipart Transfer

Provide reliable transfer with restart

      - object parts tracked in the catalog

Later versions will provide fast,

first class access to object storage

iRODS 4.2 and Beyond - The Scatter

Next Generation Query Interface

iRODS 4.3 and Beyond - The Gather

Shared Data - Shared Infrastructure

Metadata Templates

Business Model

iRODS Consortium

The iRODS Consortium

Our Mission

  • Write Good Software
  • Grow the Community
  • Show Value to our Membership

Why Open Source

  • Transparency
  • Quality
  • Persistence
  • Vendor Neutrality
  • Customization
  • Community
  • Try before you buy

Our Membership

Our Business Model

Consortium Membership

  • Participate in roadmap development
  • Participate in consortium governance
  • Direct support from the team
  • Tier 3 support agreements
  • Discount for support agreements

Our Business Model

Service & Support Contracts

  • Billed hourly
  • Implement Proofs of Concept
  • Custom rule and plugin development
  • Expand to new use cases
  • Discounted rate for consortium members

Membership Committees

Technology Working Group

  • Monthly web conferences
  • Build iRODS Roadmap
  • Propose new technology direction
  • Propose inclusion of new software
  • Propose new working groups

Membership Committees

Planning Committee

  • Monthly web conferences
  • Discuss consortium policy and business practices
  • Propose conferences and workshops
  • Vote on inclusion of new software
  • Vote on roadmap

Membership Committees

Executive Board

  • Meets twice yearly
  • Votes on consortium budget and bylaw changes
  • Determines the thematic priorities of the consortium

 

Additional working groups are formed as required

Our Consortium Participation

CyberCarpentry - Introduction to Data Management

By iRODS Consortium

CyberCarpentry - Introduction to Data Management

An overview of how iRODS provides data management, its architecture, use cases, and the roadmap

  • 1,486