January 30, 2019

Cloud Synchronization and Sharing Serivces

Rome, Italy

Jason Coposky

@jason_coposky

Executive Director, iRODS Consortium

Managing Data

from the Edge to HPC

Managing Data

from the Edge to HPC

Data Management

"The development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets."

 

 

Most organizations are still managing their assets with a collection of small scripts, tribal knowledge, vigilance, and hope.

 


Organizations, instead, need a future-proof solution to managing data and its surrounding infrastructure.

Why Data Management Matters

As data matures and reaches a broader community, data management policy must also evolve to meet these additional requirements.

Typical Data Flow

Devices / Sensors                                          On Premise / Cloud

Incoming source data from satellites, sequencers, microscopes, ... sheep

The Problem

Data is coming in with greater...

  • Volume
  • Velocity
  • Variety

 

Human-throttled ingestion and cleaning is no longer sufficient.

  • Should be handled with policy and procedure
  • Should be handled with code 
  • Should be handled closer to point of creation

 

 

Where is the Edge?

Devices / Sensors                                           On Premise / Cloud

Where does the data come under management?

Where can it be vouched for?

Where can it be trusted?

A Modest Proposal

iRODS is open source data management software

 

 

 

Provides insurance against your changing infrastructure:

  • edge devices
  • storage
  • compute
  • networking
  • authentication

iRODS Core Competencies

The underlying technology categorized into four areas

iRODS Policy Examples

  • Data Routing
  • Data Movement
  • Data Verification
  • Data Synchronization
  • Data Transformation
  • Metadata Capture
  • Metadata Application
  • Metadata Verification

iRODS Capabilities

Deployment Patterns

Data to Compute

Compute to Data

Filesystem Synchronization

The Data Management Model

Where is the Edge?

Devices / Sensors                                           On Premise / Cloud

Create a logical namespace

Where is the Edge?

Devices / Sensors          Edge                        On Premise / Cloud

Move the point of ingestion closer to the source.  Ingest on site.  Ingest at the point of data creation.

UNIFIED NAMESPACE

The Data Lifecycle begins at Data Generation

By bringing data management to the point of data generation

(and extending the programmatic surface out to the instruments),

a system with this architecture can address other hard problems:

  • Data Harmonization
  • Data Movement
  • Data Integrity
  • Geographic Distribution
  • Network Capacity
  • Network Reliability
  • Variety of Data Sources
  • Variety of Data Formats

Automated Ingest - Landing Zone

Automated Ingest - Filesystem Scanning

Data to Compute

Compute to Data

Resources

iRODS Overview and Diagrams

        https://irods.org/documentation

Official Documentation

        https://docs.irods.org

iRODS Training Materials and Presentations

        https://slides.com/irods

iRODS User Group

        https://irods.org/ugm2019

Questions?

CS3 - Managing Data from the Edge to HPC

By jason coposky

CS3 - Managing Data from the Edge to HPC

A discussion of the ingest of data from "The Edge", the capture and application of metadata, and the policy around taking the data to compute.

  • 1,438