Automated Ingest

May 30, 2018

Great Plains Network 2018

Kansas City, MO

Jason Coposky

@jason_coposky

Executive Director, iRODS Consortium

Automated Ingest

iRODS Capabilities

  • Packaged and supported solutions
  • Require configuration not code
  • Derived from the majority of use cases observed in the user community

Automated Ingest

Provide a flexible and highly scalable mechanism for data ingest

  • Directly ingest files
  • Capture streaming data
  • Register data in place
  • Synchronize a file system with the catalog
  • Extract and apply metadata

Architecture Overview

  • Implemented with Python iRODS Client
  • Based on Redis and Redis-Queue
  • Any number of workers distributed across servers
  • Policy defined through event callbacks to user provided python module
  • File system metadata cached in Redis to detect changes
  • iRODS API is invoked only to update the catalog

Getting Started

sudo apt-get install -y redis-server
sudo apt-get install -y python-pip

sudo pip2 install virtualenv

As the ubuntu user

virtualenv -p python3 rodssync
source rodssync/bin/activate
pip3 install rq python-redis-lock rq-scheduler python-irodsclient structlog
git clone https://github.com/irods/irods_capability_automated_ingest
cd irods_capability_automated_ingest

As the irods user

Getting Started

source rodssync/bin/activate
cd irods_capability_automated_ingest
rqscheduler -i 1

As the irods user - in separate terminals

source rodssync/bin/activate
cd irods_capability_automated_ingest
while true; do clear ; rq info ; sleep 2 ; done
source rodssync/bin/activate
cd irods_capability_automated_ingest
rqworker restart path file

Start the Redis Queue Scheduler

Start a single Redis Queue Worker

Start a Redis Queue Monitor

Getting Started

mkdir /tmp/test_dir
cp /tmp/stickers.jpg /tmp/test_dir/img0.jpg
cp /tmp/stickers.jpg /tmp/test_dir/img1.jpg
cp /tmp/stickers.jpg /tmp/test_dir/img2.jpg
cp /tmp/stickers.jpg /tmp/test_dir/img3.jpg
cp /tmp/stickers.jpg /tmp/test_dir/img4.jpg

Generate some source data

mkdir ./src_dir
cp -r /tmp/test_dir ./src_dir/dir0

Stage source data

Default Ingest Behavior

By default the framework will register the data in place against the default resource into the given collection

imkdir reg_coll

python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/reg_coll

Check the redis monitor terminal for the restart, file and path queues

Default Ingest Behavior

Check that our results are registered in place

$ ils -L reg_coll/dir0
/tempZone/home/rods/reg_coll/dir0:
  rods              0 demoResc      2157087 2018-05-21.05:54 & img0.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img0.jpg
  rods              0 demoResc      2157087 2018-05-21.05:54 & img1.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img1.jpg
  rods              0 demoResc      2157087 2018-05-21.05:54 & img2.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img2.jpg
  rods              0 demoResc      2157087 2018-05-21.05:54 & img3.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img3.jpg
  rods              0 demoResc      2157087 2018-05-21.05:54 & img4.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img4.jpg

Default Ingest Behavior

Stage some more data and re-run the ingest

cp -r /tmp/test_dir ./src_dir/dir1

python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/reg_coll
$ ils -L reg_coll/dir1
/tempZone/home/rods/reg_coll/dir1:
  rods              0 demoResc      2157087 2018-05-21.06:11 & img0.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img0.jpg
  rods              0 demoResc      2157087 2018-05-21.06:11 & img1.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img1.jpg
  rods              0 demoResc      2157087 2018-05-21.06:11 & img2.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img2.jpg
  rods              0 demoResc      2157087 2018-05-21.06:11 & img3.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img3.jpg
  rods              0 demoResc      2157087 2018-05-21.06:11 & img4.jpg
        generic    /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img4.jpg

Check our results

Customizing the Ingest Behavior

The ingest tool is a callback based system in which the system invokes methods within the custom event handler module provided by the administrator.

 

These events may then take action such as setting ACLs, providing additional context such as selection of storage resources, or extracting and applying metadata.

Customizing the Ingest Behavior

Method Effect Default Result
pre_data_obj_create user-defined python none
post_data_obj_create ​user-defined python none
pre_data_obj_modify ​user-defined python none
post_data_obj_modify ​user-defined python none
pre_coll_create ​user-defined python none
post_coll_create user-defined python none
as_user takes action as this iRODS user authenticated user
target_path ​set mount path on the irods server which can be different from client mount path client mount path
to_resource as provided by client environment
operation operation.REGISTER_SYNC

Available event callbacks

Customizing the Ingest Behavior

The operation mode is returning during the 'operation' method which informs the ingest tool as to which behavior is desired for a given ingest job.

 

Should the default behavior be overridden one of these operations must be selected and returned.  These operations are hard coded into the tool, and cover the typical use cases of data ingest.

Customizing the Ingest Behavior

Method Effect Default Result
Operation.REGISTER_SYNC (default) registers in catalog updates size in catalog
Operator.REGISTER_AS_REPLICA_SYNC registers first or additional replica updates size in catalog
Operator.PUT copies file to target vault, and registers in catalog no action
Operator.PUT_SYNC copies file to target vault, and registers in catalog copies entire file again, and updates catalog
Operator.PUT_APPEND copies file to target vault, and registers in catalog copies only appended part of file, and updates catalog

Available Operations

Example ingest modules

examples/put_with_resc_name.py

from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation
class event_handler(Core):    <-- expected interface
    @staticmethod
    def to_resource(session, target, path, **options): <-- method
        return "regiResc2a"                            <-- expected side effect

    @staticmethod
    def operation(session, target, path, **options): <-- method
        return Operation.PUT                         <-- operation

For this example we would want to change 'regResc2a' to our target resource

Example Ingest Modules

examples/sync_root_with_resc_name.py

from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation

class event_handler(Core):
    @staticmethod
    def to_resource(session, target, path, **options):
        return "regiResc2Root"

    @staticmethod
    def operation(session, target, path, **options):
        return Operation.PUT_SYNC

Note that this uses a different operation which will not just ingest new data but synchronize previously ingested data

Custom Ingest with Metadata Extraction

Install the exifread python library

pip3 install exifread

With the editor of your choice, create and edit:

irods_capability_automated_ingest/examples/put_with_resc_name_image_metadata.py

Create a new ingest collection

imkdir put_coll

Custom Ingest with Metadata Extraction

Implement the usual event handler and operation

We will additional event handlers:

  • post_data_obj_create 
  • post_data_obj_modify 

Using the python iRODS client we will apply metadata extracted by exifread

Should the metadata key already exist we overwrite it with the new value

Custom Ingest with Metadata Extraction

import exifread
from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation

def add_exif_metadata(session, target, path):
    with open(path, 'rb') as f:
        obj = session.data_objects.get(target)
        tags = exifread.process_file(f, details=False)
        for (k, v) in tags.items():
            if k not in ('JPEGThumbnail','TIFFThumbnail','Filename','EXIF MakerNote'):
                if k in obj.metadata.keys():
                        obj.metadata[k] = iRODSMeta(k, v)
                else:
                    obj.metadata.add(str(k), str(v), '')

class event_handler(Core):  
    @staticmethod
    def to_resource(session, target, path, **options):
        return "demoResc"
    @staticmethod
    def operation(session, target, path, **options):
        return Operation.PUT
    @staticmethod
    def post_data_obj_create(hdlr_mod, logger, session, target, path, **options):
        add_exif_metadata(session, target, path)
    @staticmethod
    def post_data_obj_modify(hdlr_mod, logger, session, target, path, **options):
        add_exif_metadata(session, target, path)

Custom Ingest with Metadata Extraction

Should we want this module to register in place, we can simply change the operation returned to REGISTER_SYNC or REGISTER_AS_REPLICA_SYNC

cp -r /tmp/test_dir ./src_dir/dir2

Stage new test data

python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/put_coll --event_handler irods_capability_automated_ingest.examples.put_with_resc_name_image_metadata

Launch the ingest job

Custom Ingest with Metadata Extraction

$ ils -l put_coll/dir2
/tempZone/home/rods/sync_coll/dir2:
  rods              0 demoResc      2157087 2018-05-22.11:16 & img0.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img1.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img2.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img3.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img4.jpg

Check our results

Custom Ingest with Metadata Extraction

$ imeta ls -d put_coll/dir2/img0.jpg
AVUs defined for dataObj sync_coll/dir4/img0.jpg:
attribute: EXIF ApertureValue
value: 7983/3509
units:
----
attribute: EXIF BrightnessValue
value: 2632/897
units:
----
<snip>
----
attribute: Thumbnail YResolution
value: 72
units:

Inspect our newly harvested metadata

Any python library can now be leveraged to extract and apply metadata.  Mime type can be detected and mapped to the appropriate metadata extraction

The Landing Zone

The Landing Zone

In this use case data is written to disk by an instrument or another source we can run an ingest job on that directory.

Once data is ingested it is moved out of the way in order to improve ingest performance.

These ingested files can be removed later as a matter of local administrative policy.

The critical different between a pure file system scan and a landing zone is that the LZ is not considered the single point of truth, it is a staging area for ingest (moving files out of the way).

In a file system scan the file system is the canonical replica and the catalog and other replicas are kept in sync.

The Landing Zone

Create a new ingest collection

imkdir lz_coll

Create the landing zone and ingested directories

mkdir /tmp/landing_zone
mkdir /tmp/ingested

Stage data for ingest

cp -r /tmp/test_dir /tmp/landing_zone/dir0

Preparing a Landing Zone

With the editor of your choice, create and edit:

irods_capability_automated_ingest/examples/lz_put_with_resc_name_image_metadata.py

The Landing Zone

import exifread
import os
from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation
def add_exif_metadata(session, target, path):
    with open(path, 'rb') as f:
        obj = session.data_objects.get(target)
        try:
            tags = exifread.process_file(f, details=False)
            for (k, v) in tags.items():
                if k not in ('JPEGThumbnail','TIFFThumbnail','Filename','EXIF MakerNote'):
                    if k in obj.metadata.keys():
                        obj.metadata[k] = iRODSMeta(k, v)
                    else:
                        obj.metadata.add(str(k), str(v), '')
        except:
            pass

class event_handler(Core):
    @staticmethod
    def to_resource(session, target, path, **options):
        return "demoResc"
    @staticmethod
    def operation(session, target, path, **options):
        return Operation.PUT
    @staticmethod
    def post_data_obj_create(hdlr_mod, logger, session, target, path, **options):
        add_exif_metadata(session, target, path)
        new_path = path.replace('/tmp/landing_zone', '/tmp/ingested')
        try:
            dir_name = os.path.dirname(new_path)
            os.makedirs(dir_name, exist_ok=True)
            os.rename(path, new_path)
        except:
            logger.info('FAILED to move ['+path+'] to ['+new_path+']')

The Landing Zone

$ python3 -m irods_capability_automated_ingest.irods_sync start /tmp/landing_zone /tempZone/home/rods/lz_coll --event_handler irods_capability_automated_ingest.examples.lz_put_with_resc_name_image_metadata

Launch the ingest job

$ ls -l /tmp/ingested/dir0/
total 10540
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img0.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img1.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img2.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img3.jpg
-rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img4.jpg

Check the ingested directory for the files

The Landing Zone

$ ils -l lz_coll/dir0
/tempZone/home/rods/lz_coll/dir0:
  rods              0 demoResc      2157087 2018-05-22.11:16 & img0.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img1.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img2.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img3.jpg
  rods              0 demoResc      2157087 2018-05-22.11:16 & img4.jpg

Check the ingest collection lz_coll

The Landing Zone

Check the ingested metadata

$ imeta ls -d lz_coll/dir2/img0.jpg
AVUs defined for dataObj sync_coll/dir4/img0.jpg:
attribute: EXIF ApertureValue
value: 7983/3509
units:
----
attribute: EXIF BrightnessValue
value: 2632/897
units:
----
<snip>
----
attribute: Thumbnail YResolution
value: 72
units:

Streaming Ingest

For this use case some instruments periodically append output to a file or set of files.

We will configure an ingest job for a file to which a process will periodically append some data.

#!/bin/bash
file=$1
touch $file
for i in {1..1000}; do
    echo "do more science!!!!\n" | cat >> $file
    sleep 2
done

Edit append_data.sh (make sure its executable)

Streaming Ingest

Modify irods_capability_automated_ingest/examples/append_with_resc_name.py

from irods_capability_automated_ingest.core import Core
from irods_capability_automated_ingest.utils import Operation
class event_handler(Core):    
    @staticmethod
    def to_resource(session, target, path, **options):
        return "regiResc2a"

    @staticmethod
    def operation(session, target, path, **options):
        return Operation.PUT_APPEND

"regiResc2a" should become "demoResc"

Streaming Ingest

Prepare the source directory

$ rm -r ./src_dir/*

In another terminal start the data creation process

$ ./append_data.sh ./src_dir/science.txt

Create a new ingest collection

$ imkdir stream_coll

Streaming Ingest

Check our results

$ ils -l stream_coll/science.txt
  rods              0 demoResc          374 2018-05-23.10:13 & science.txt
$ ils -l stream_coll/science.txt
  rods              0 demoResc          418 2018-05-23.10:13 & science.txt
$ ils -l stream_coll/science.txt
  rods              0 demoResc          440 2018-05-23.10:13 & science.txt
<SNIP>

Start the ingest process with a restart interval of 1s

$ python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/stream_coll --event_handler irods_capability_automated_ingest.examples.append_with_resc_name -i 1

Streaming Ingest

Managing a periodic ingest job

$ python3 -m irods_capability_automated_ingest.irods_sync list
63099db0-5e93-11e8-b142-080027e8658d

Halting a periodic ingest job

$ python3 -m irods_capability_automated_ingest.irods_sync stop 63099db0-5e93-11e8-b142-080027e8658d

File System Scanning

File system Scanning

This implementation will periodically scan a source directory, register the data in place or update system metadata for changed files.

 

In this use case the file system is considered the single point of truth for ingest.  Changes are detected during the scan, and the system metadata is updated within the catalog.

File system Scanning

Clean up the source directory and stage data

$ rm -r ./src_dir/*
$ cp -r /tmp/test_dir ./src_dir/dir0

Create a new target collection

$ imkdir scan_coll

Launch the scanner with a period of 1 second

$ python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/scan_coll --event_handler irods_capability_automated_ingest.examples.register -i 1

File system Scanning

Investigate our results

$ ils -l scan_coll/dir0
/tempZone/home/rods/scan_coll/dir0:
  rods              0 demoResc      2157087 2018-05-24.07:28 & img0.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img1.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img2.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img3.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img4.jpg

Stage new data and investigate data registration

$ cp -r /tmp/test_dir ./src_dir/dir1
$ ils -l scan_coll/
/tempZone/home/rods/scan_coll:
  C- /tempZone/home/rods/scan_coll/dir0  
  C- /tempZone/home/rods/scan_coll/dir1  

Wait for it...

File system Scanning

Investigate our results

$ ils -l scan_coll/dir1
/tempZone/home/rods/scan_coll/dir1:
  rods              0 demoResc      2157087 2018-05-24.07:28 & img0.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img1.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img2.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img3.jpg
  rods              0 demoResc      2157087 2018-05-24.07:28 & img4.jpg

If it is not running, in another terminal

$ ./append_data.sh ./src_dir/science.txt

Launch the append script to test streaming writes

File system Scanning

Investigate our results

$ ils -l scan_coll
/tempZone/home/rods/scan_coll:
  rods              0 demoResc          154 2018-05-24.07:45 & science.txt
  C- /tempZone/home/rods/scan_coll/dir0  
  C- /tempZone/home/rods/scan_coll/dir1

$ ils -l scan_coll/science.txt
  rods              0 demoResc          264 2018-05-24.07:45 & science.txt
$ ils -l scan_coll/science.txt
  rods              0 demoResc          308 2018-05-24.07:45 & science.txt
$ ils -l scan_coll/science.txt
  rods              0 demoResc          308 2018-05-24.07:45 & science.txt

File system Scanning

Managing a periodic ingest job

$ python3 -m irods_capability_automated_ingest.irods_sync list
b0f2d6fe-5f47-11e8-b142-080027e8658d

Halting a periodic ingest job

$ python3 -m irods_capability_automated_ingest.irods_sync stop b0f2d6fe-5f47-11e8-b142-080027e8658d

Questions?

Automated Ingest - GPN 2018

By jason coposky

Automated Ingest - GPN 2018

Training for the iRODS Automated Ingest Capability

  • 1,805