Automated Ingest

Jason M. Coposky

@jason_coposky

Executive Director, iRODS Consortium

Automated Ingest

January 14-16 2020

CINES

Montpellier, France

  • Packaged and supported solutions
  • Require configuration not code
  • Derived from the majority of use cases observed in the user community

iRODS Capabilities

Automated Ingest

Provide a flexible and highly scalable mechanism for data ingest

  • Directly ingest files
  • Capture streaming data
  • Register data in place
  • Synchronize a file system with the catalog
  • Extract and apply metadata

Architecture Overview

  • Implemented with Python iRODS Client
  • Based on Redis and Redis-Queue
  • Any number of workers distributed across servers
  • Policy defined through event callbacks to user-provided python functions
  • File system metadata cached in Redis to detect changes
  • iRODS API is invoked only to update the catalog

Getting Started

sudo apt-get install -y redis-server python-pip
sudo service redis-server start

As the ubuntu user - start Redis server

pip install virtualenv --user
python -m virtualenv -p python3 rodssync
source rodssync/bin/activate
pip install irods_capability_automated_ingest

As the irods user - install automated ingest via pip

Getting Started

export CELERY_BROKER_URL=redis://127.0.0.1:6379/0
export PYTHONPATH=`pwd`
celery -A irods_capability_automated_ingest.sync_task worker -l error -Q restart,path,file

Open a new terminal, activate rodssync as the irods user, and set environment variables for the scanner

As the irods user - start Celery workers

source rodssync/bin/activate
export CELERY_BROKER_URL=redis://127.0.0.1:6379/0
export PYTHONPATH=`pwd`

Getting Started

mkdir /tmp/test_dir
cp irods_training/stickers.jpg /tmp/test_dir/img0.jpg
cp irods_training/stickers.jpg /tmp/test_dir/img1.jpg
cp irods_training/stickers.jpg /tmp/test_dir/img2.jpg
cp irods_training/stickers.jpg /tmp/test_dir/img3.jpg
cp irods_training/stickers.jpg /tmp/test_dir/img4.jpg

Generate some source data

mkdir ./src_dir
cp -r /tmp/test_dir ./src_dir/dir0

Stage source data

git clone https://github.com/irods/irods_training/

Fetch test data

Default Ingest Behavior

By default the framework will register the data in place against the default resource into the given collection

imkdir reg_coll

python -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/reg_coll

Check the redis monitor terminal for the restart, file and path queues

Default Ingest Behavior

Check that our results are registered in place

$ ils -L reg_coll/dir0
/tempZone/home/rods/reg_coll/dir0:
  rods              0 demoResc      2157087 2019-05-21.05:54 & img0.jpg
        generic    /var/lib/irods/src_dir/dir0/img0.jpg
  rods              0 demoResc      2157087 2019-05-21.05:54 & img1.jpg
        generic    /var/lib/irods/src_dir/dir0/img1.jpg
  rods              0 demoResc      2157087 2019-05-21.05:54 & img2.jpg
        generic    /var/lib/irods/src_dir/dir0/img2.jpg
  rods              0 demoResc      2157087 2019-05-21.05:54 & img3.jpg
        generic    /var/lib/irods/src_dir/dir0/img3.jpg
  rods              0 demoResc      2157087 2019-05-21.05:54 & img4.jpg
        generic    /var/lib/irods/src_dir/dir0/img4.jpg

Default Ingest Behavior

Stage a different set of data and re-run the ingest

cp -r /tmp/test_dir ./src_dir/dir1

python -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/reg_coll
$ ils -L reg_coll/dir1
/tempZone/home/rods/reg_coll/dir1:
  rods              0 demoResc      2157087 2019-05-21.06:11 & img0.jpg
        generic    /var/lib/irods/src_dir/dir1/img0.jpg
  rods              0 demoResc      2157087 2019-05-21.06:11 & img1.jpg
        generic    /var/lib/irods/src_dir/dir1/img1.jpg
  rods              0 demoResc      2157087 2019-05-21.06:11 & img2.jpg
        generic    /var/lib/irods/src_dir/dir1/img2.jpg
  rods              0 demoResc      2157087 2019-05-21.06:11 & img3.jpg
        generic    /var/lib/irods/src_dir/dir1/img3.jpg
  rods              0 demoResc      2157087 2019-05-21.06:11 & img4.jpg
        generic    /var/lib/irods/src_dir/dir1/img4.jpg

Check our results

Customizing the Ingest Behavior

The ingest tool is a callback based system in which the system invokes methods within the custom event handler module provided by the administrator.

 

These events may then take action such as setting ACLs, providing additional context such as selection of storage resources, or extracting and applying metadata.

Customizing the Ingest Behavior

Method

Effect

Default

pre_data_obj_create

user-defined python

none

post_data_obj_create

​user-defined python

none

pre_data_obj_modify

​user-defined python

none

post_data_obj_modify

​user-defined python

none

pre_coll_create ​user-defined python

none

post_coll_create

user-defined python

none

as_user takes action as this iRODS user

authenticated user

target_path ​set mount path on the irods server which can be different from client mount path client mount path

to_resource

defines target resource request of operation as provided by client environment

operation

defines the mode of operation Operation.REGISTER_SYNC

Available --event_handler methods

Customizing the Ingest Behavior

The operation mode is returned during the 'operation' method which informs the ingest tool as to which behavior is desired for a given ingest job.

 

To override the default behavior, one of these operations must be selected and returned.  These operations are hard coded into the tool, and cover the typical use cases of data ingest.

Customizing the Ingest Behavior

Operation New Files Updated Files

Operation.REGISTER_SYNC (default)

registers in catalog updates size in catalog
Operation.REGISTER_AS_REPLICA_SYNC

registers first or additional replica

updates size in catalog

Operation.PUT

copies file to target vault, and registers in catalog

no action

Operation.PUT_SYNC

copies file to target vault, and registers in catalog

copies entire file again, and updates catalog
Operation.PUT_APPEND

copies file to target vault, and registers in catalog

copies only appended part of file, and updates catalog

Available Operations

Example Ingest Modules

./rodssync/lib/python3.5/site-packages/irods_capability_automated_ingest/examples/put_with_resc_name.py

from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation

class event_handler(Core):                             <-- expected interface
    @staticmethod
    def to_resource(session, target, path, **options): <-- method
        return "regiResc2a"                            <-- expected side effect

    @staticmethod
    def operation(session, target, path, **options):   <-- method
        return Operation.PUT                           <-- operation

For this example we would want to change 'regiResc2a' to our target resource

Example Ingest Modules

./rodssync/lib/python3.5/site-packages/irods_capability_automated_ingest/examples/sync_root_with_resc_name.py

from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation

class event_handler(Core):
    @staticmethod
    def to_resource(session, target, path, **options):
        return "regiResc2Root"

    @staticmethod
    def operation(session, target, path, **options):
        return Operation.PUT_SYNC

Note that this uses a different operation which will not just ingest new data but synchronize previously ingested data

Custom Ingest with Metadata Extraction

Install the exifread python library

pip install exifread

./rodssync/lib/python3.5/site-packages/irods_capability_automated_ingest/examples/put_with_resc_name_image_metadata.py

Create a new ingest collection

imkdir put_coll

With the editor of your choice, create and edit:

Custom Ingest with Metadata Extraction

Implement the usual event handler and operation

 

We will implement two additional event handlers:

  • post_data_obj_create 

  • post_data_obj_modify

 

Using the python iRODS client we will apply metadata extracted by exifread

 

Should the metadata key already exist we overwrite it with the new value

Custom Ingest with Metadata Extraction

import exifread
from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation
from irods.meta import iRODSMeta

def add_exif_metadata(session, target, path):
    with open(path, 'rb') as f:
        obj = session.data_objects.get(target)
        tags = exifread.process_file(f, details=False)
        for (k, v) in tags.items():
            if k not in ('JPEGThumbnail','TIFFThumbnail','Filename','EXIF MakerNote'):
                if k in obj.metadata.keys():
                        obj.metadata[k] = iRODSMeta(k, v)
                else:
                    obj.metadata.add(str(k), str(v), '')

class event_handler(Core):  
    @staticmethod
    def to_resource(session, meta, **options):
        return "demoResc"
    @staticmethod
    def operation(session, meta, **options):
        return Operation.PUT
    @staticmethod
    def post_data_obj_create(hdlr_mod, logger, session, meta, **options):
        add_exif_metadata(session, meta['target'], meta['path'])
    @staticmethod
    def post_data_obj_modify(hdlr_mod, logger, session, meta, **options):
        add_exif_metadata(session, meta['target'], meta['path'])

Custom Ingest with Metadata Extraction

Should we want this module to register in place, we can simply change the operation returned to REGISTER_SYNC or REGISTER_AS_REPLICA_SYNC

cp -r /tmp/test_dir ./src_dir/dir2

Stage new test data

python -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/put_coll --event_handler irods_capability_automated_ingest.examples.put_with_resc_name_image_metadata

Launch the ingest job (all one line)

Custom Ingest with Metadata Extraction

$ ils -l put_coll/dir2
/tempZone/home/rods/put_coll/dir2:
  rods              0 demoResc      2157087 2019-05-22.11:16 & img0.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img1.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img2.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img3.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img4.jpg

Check our results

Custom Ingest with Metadata Extraction

$ imeta ls -d put_coll/dir2/img0.jpg
AVUs defined for dataObj sync_coll/dir4/img0.jpg:
attribute: EXIF ApertureValue
value: 7983/3509
units:
----
attribute: EXIF BrightnessValue
value: 2632/897
units:
----
<snip>
----
attribute: Thumbnail YResolution
value: 72
units:

Inspect our newly harvested metadata

Any python library can now be leveraged to extract and apply metadata.  Mime type can be detected and mapped to the appropriate metadata extraction.

The Landing Zone

The Landing Zone

In this use case, data is written to disk by an instrument or another source we can run an ingest job on that directory.​  Once data is ingested it is moved out of the way in order to improve ingest performance.  These ingested files can be removed later as a matter of local administrative policy.

 

 

The critical difference between a pure file system scan and a landing zone is that the LZ is not considered the single point of truth, it is a staging area for ingest (moving files out of the way).

 

In a file system scan, the file system is the canonical replica and the catalog and other replicas are kept in sync.

The Landing Zone

Create a new ingest collection

imkdir lz_coll

Create the landing zone and ingested directories

mkdir /tmp/landing_zone
mkdir /tmp/ingested

Stage data for ingest

cp -r /tmp/test_dir /tmp/landing_zone/dir0

Preparing a Landing Zone

With the editor of your choice, create and edit:

./rodssync/lib/python3.5/site-packages/irods_capability_automated_ingest/examples/lz_put_with_resc_name_image_metadata.py

The Landing Zone

import exifread
import os
from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation
def add_exif_metadata(session, target, path):
    with open(path, 'rb') as f:
        obj = session.data_objects.get(target)
        try:
            tags = exifread.process_file(f, details=False)
            for (k, v) in tags.items():
                if k not in ('JPEGThumbnail','TIFFThumbnail','Filename','EXIF MakerNote'):
                    if k in obj.metadata.keys():
                        obj.metadata[k] = iRODSMeta(k, v)
                    else:
                        obj.metadata.add(str(k), str(v), '')
        except:
            pass
class event_handler(Core):
    @staticmethod
    def to_resource(session, meta, **options):
        return "demoResc"
    @staticmethod
    def operation(session, meta, **options):
        return Operation.PUT
    @staticmethod
    def post_data_obj_create(hdlr_mod, logger, session, meta, **options):
        path = meta['path']
        add_exif_metadata(session, meta['target'], meta['path'])
        new_path = path.replace('/tmp/landing_zone', '/tmp/ingested')
        try:
            dir_name = os.path.dirname(new_path)
            os.makedirs(dir_name, exist_ok=True)
            os.rename(path, new_path)
        except:
            logger.info('FAILED to move ['+path+'] to ['+new_path+']')

The Landing Zone

python -m irods_capability_automated_ingest.irods_sync start /tmp/landing_zone /tempZone/home/rods/lz_coll --event_handler irods_capability_automated_ingest.examples.lz_put_with_resc_name_image_metadata

Launch the ingest job

$ ls -l /tmp/ingested/dir0/
total 10540
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img0.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img1.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img2.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img3.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img4.jpg

Check the ingested directory for the files

The Landing Zone

$ ils -l lz_coll/dir0
/tempZone/home/rods/lz_coll/dir0:
  rods              0 demoResc      2157087 2019-05-22.11:16 & img0.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img1.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img2.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img3.jpg
  rods              0 demoResc      2157087 2019-05-22.11:16 & img4.jpg

Check the ingested collection lz_coll

The Landing Zone

Check the ingested metadata

$ imeta ls -d lz_coll/dir0/img0.jpg
AVUs defined for dataObj lz_coll/dir0/img0.jpg:
attribute: EXIF ApertureValue
value: 7983/3509
units:
----
attribute: EXIF BrightnessValue
value: 2632/897
units:
----
<snip>
----
attribute: Thumbnail YResolution
value: 72
units:

Streaming Ingest

For this use case some instruments periodically append output to a file or set of files.

 

We will configure an ingest job for a file to which a process will periodically append some data.

#!/bin/bash
file=$1
touch $file
for i in {1..1000}; do
    echo "do more science!!!!" >> $file
    sleep 2
done

Edit append_data.sh (make sure it's executable)

Streaming Ingest

Modify:

./rodssync/lib/python3.5/site-packages/irods_capability_automated_ingest/examples/append_with_resc_name.py

from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation
class event_handler(Core):    
    @staticmethod
    def to_resource(session, target, path, **options):
        return "regiResc2a"

    @staticmethod
    def operation(session, target, path, **options):
        return Operation.PUT_APPEND

"regiResc2a" should become "demoResc"

Streaming Ingest

Prepare the source directory

rm -r ./src_dir/*

In another terminal start the data creation process

./append_data.sh ./src_dir/science.txt

Create a new ingest collection

imkdir stream_coll

Streaming Ingest

Check the results --- science.txt is growing in size

$ ils -l stream_coll/science.txt
  rods              0 demoResc          374 2019-05-23.10:13 & science.txt
$ ils -l stream_coll/science.txt
  rods              0 demoResc          418 2019-05-23.10:13 & science.txt
$ ils -l stream_coll/science.txt
  rods              0 demoResc          440 2019-05-23.10:13 & science.txt
<SNIP>

Start the ingest process with a restart interval of 1s

python -m irods_capability_automated_ingest.irods_sync start \
./src_dir /tempZone/home/rods/stream_coll \
--event_handler \
irods_capability_automated_ingest.examples.append_with_resc_name -i 1

Streaming Ingest

Listing a periodic ingest job

$ python -m irods_capability_automated_ingest.irods_sync list
{"singlepass": [], "periodic": ["3defdd48-943c-11e9-aba6-123bf4b544e2"]}

Halting a running periodic ingest job

$ python -m irods_capability_automated_ingest.irods_sync \
    stop 3defdd48-943c-11e9-aba6-123bf4b544e2

File System Scanning

File System Scanning

This implementation will periodically scan a source directory, register the data in place, or update system metadata for changed files.

 

In this use case, the file system is considered the single point of truth for ingest.  Changes are detected during the scan, and the system metadata is updated within the catalog.

File System Scanning

Clean up the source directory and stage data

rm -r ./src_dir/*
cp -r /tmp/test_dir ./src_dir/dir0

Create a new target collection

imkdir scan_coll

Launch the scanner with a period of 1 second

python -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/scan_coll --event_handler irods_capability_automated_ingest.examples.register -i 1

File System Scanning

Investigate the results

$ ils -l scan_coll/dir0
/tempZone/home/rods/scan_coll/dir0:
  rods              0 demoResc      2157087 2019-05-24.07:28 & img0.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img1.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img2.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img3.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img4.jpg

Stage new data and investigate data registration

$ cp -r /tmp/test_dir ./src_dir/dir1
$ ils -l scan_coll/
/tempZone/home/rods/scan_coll:
  C- /tempZone/home/rods/scan_coll/dir0  
  C- /tempZone/home/rods/scan_coll/dir1  

Wait for it...

File System Scanning

Investigate the results

$ ils -l scan_coll/dir1
/tempZone/home/rods/scan_coll/dir1:
  rods              0 demoResc      2157087 2019-05-24.07:28 & img0.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img1.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img2.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img3.jpg
  rods              0 demoResc      2157087 2019-05-24.07:28 & img4.jpg

If it is not running, in another terminal

$ ./append_data.sh ./src_dir/science.txt

Launch the append script to test streaming writes

File System Scanning

Investigate the results

$ ils -l scan_coll
/tempZone/home/rods/scan_coll:
  rods              0 demoResc          154 2019-05-24.07:45 & science.txt
  C- /tempZone/home/rods/scan_coll/dir0  
  C- /tempZone/home/rods/scan_coll/dir1

$ ils -l scan_coll/science.txt
  rods              0 demoResc          264 2019-05-24.07:45 & science.txt
$ ils -l scan_coll/science.txt
  rods              0 demoResc          308 2019-05-24.07:45 & science.txt
$ ils -l scan_coll/science.txt
  rods              0 demoResc          308 2019-05-24.07:45 & science.txt

File System Scanning

Managing a periodic ingest job

$ python -m irods_capability_automated_ingest.irods_sync list
{"periodic": ["c07e84bc-943c-11e9-aba6-123bf4b544e2"], "singlepass": []}

Halting a periodic ingest job

$ python -m irods_capability_automated_ingest.irods_sync \
     stop b0f2d6fe-5f47-11e8-b142-080027e8658d

Questions?