Automated Ingest
May 30, 2018
Great Plains Network 2018
Kansas City, MO
Jason Coposky
@jason_coposky
Executive Director, iRODS Consortium
Automated Ingest
iRODS Capabilities
- Packaged and supported solutions
- Require configuration not code
- Derived from the majority of use cases observed in the user community
Automated Ingest
Provide a flexible and highly scalable mechanism for data ingest
- Directly ingest files
- Capture streaming data
- Register data in place
- Synchronize a file system with the catalog
- Extract and apply metadata
Architecture Overview
- Implemented with Python iRODS Client
- Based on Redis and Redis-Queue
- Any number of workers distributed across servers
- Policy defined through event callbacks to user provided python module
- File system metadata cached in Redis to detect changes
- iRODS API is invoked only to update the catalog
Getting Started
sudo apt-get install -y redis-server sudo apt-get install -y python-pip
sudo pip2 install virtualenv
As the ubuntu user
virtualenv -p python3 rodssync source rodssync/bin/activate pip3 install rq python-redis-lock rq-scheduler python-irodsclient structlog git clone https://github.com/irods/irods_capability_automated_ingest cd irods_capability_automated_ingest
As the irods user
Getting Started
source rodssync/bin/activate
cd irods_capability_automated_ingest
rqscheduler -i 1
As the irods user - in separate terminals
source rodssync/bin/activate
cd irods_capability_automated_ingest
while true; do clear ; rq info ; sleep 2 ; done
source rodssync/bin/activate
cd irods_capability_automated_ingest
rqworker restart path file
Start the Redis Queue Scheduler
Start a single Redis Queue Worker
Start a Redis Queue Monitor
Getting Started
mkdir /tmp/test_dir cp /tmp/stickers.jpg /tmp/test_dir/img0.jpg cp /tmp/stickers.jpg /tmp/test_dir/img1.jpg cp /tmp/stickers.jpg /tmp/test_dir/img2.jpg cp /tmp/stickers.jpg /tmp/test_dir/img3.jpg cp /tmp/stickers.jpg /tmp/test_dir/img4.jpg
Generate some source data
mkdir ./src_dir
cp -r /tmp/test_dir ./src_dir/dir0
Stage source data
Default Ingest Behavior
By default the framework will register the data in place against the default resource into the given collection
imkdir reg_coll python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/reg_coll
Check the redis monitor terminal for the restart, file and path queues
Default Ingest Behavior
Check that our results are registered in place
$ ils -L reg_coll/dir0 /tempZone/home/rods/reg_coll/dir0: rods 0 demoResc 2157087 2018-05-21.05:54 & img0.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img0.jpg rods 0 demoResc 2157087 2018-05-21.05:54 & img1.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img1.jpg rods 0 demoResc 2157087 2018-05-21.05:54 & img2.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img2.jpg rods 0 demoResc 2157087 2018-05-21.05:54 & img3.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img3.jpg rods 0 demoResc 2157087 2018-05-21.05:54 & img4.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir0/img4.jpg
Default Ingest Behavior
Stage some more data and re-run the ingest
cp -r /tmp/test_dir ./src_dir/dir1
python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/reg_coll
$ ils -L reg_coll/dir1 /tempZone/home/rods/reg_coll/dir1: rods 0 demoResc 2157087 2018-05-21.06:11 & img0.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img0.jpg rods 0 demoResc 2157087 2018-05-21.06:11 & img1.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img1.jpg rods 0 demoResc 2157087 2018-05-21.06:11 & img2.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img2.jpg rods 0 demoResc 2157087 2018-05-21.06:11 & img3.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img3.jpg rods 0 demoResc 2157087 2018-05-21.06:11 & img4.jpg generic /var/lib/irods/irods_capability_automated_ingest/src_dir/dir1/img4.jpg
Check our results
Customizing the Ingest Behavior
The ingest tool is a callback based system in which the system invokes methods within the custom event handler module provided by the administrator.
These events may then take action such as setting ACLs, providing additional context such as selection of storage resources, or extracting and applying metadata.
Customizing the Ingest Behavior
Method | Effect | Default Result |
---|---|---|
pre_data_obj_create | user-defined python | none |
post_data_obj_create | user-defined python | none |
pre_data_obj_modify | user-defined python | none |
post_data_obj_modify | user-defined python | none |
pre_coll_create | user-defined python | none |
post_coll_create | user-defined python | none |
as_user | takes action as this iRODS user | authenticated user |
target_path | set mount path on the irods server which can be different from client mount path | client mount path |
to_resource | as provided by client environment | |
operation | operation.REGISTER_SYNC |
Available event callbacks
Customizing the Ingest Behavior
The operation mode is returning during the 'operation' method which informs the ingest tool as to which behavior is desired for a given ingest job.
Should the default behavior be overridden one of these operations must be selected and returned. These operations are hard coded into the tool, and cover the typical use cases of data ingest.
Customizing the Ingest Behavior
Method | Effect | Default Result |
---|---|---|
Operation.REGISTER_SYNC (default) | registers in catalog | updates size in catalog |
Operator.REGISTER_AS_REPLICA_SYNC | registers first or additional replica | updates size in catalog |
Operator.PUT | copies file to target vault, and registers in catalog | no action |
Operator.PUT_SYNC | copies file to target vault, and registers in catalog | copies entire file again, and updates catalog |
Operator.PUT_APPEND | copies file to target vault, and registers in catalog | copies only appended part of file, and updates catalog |
Available Operations
Example ingest modules
examples/put_with_resc_name.py
from irods_capability_automated_ingest.core import Core from irods_capability_automated_ingest.utils import Operation class event_handler(Core): <-- expected interface @staticmethod def to_resource(session, target, path, **options): <-- method return "regiResc2a" <-- expected side effect @staticmethod def operation(session, target, path, **options): <-- method return Operation.PUT <-- operation
For this example we would want to change 'regResc2a' to our target resource
Example Ingest Modules
examples/sync_root_with_resc_name.py
from irods_capability_automated_ingest.core import Core from irods_capability_automated_ingest.utils import Operation class event_handler(Core): @staticmethod def to_resource(session, target, path, **options): return "regiResc2Root" @staticmethod def operation(session, target, path, **options): return Operation.PUT_SYNC
Note that this uses a different operation which will not just ingest new data but synchronize previously ingested data
Custom Ingest with Metadata Extraction
Install the exifread python library
pip3 install exifread
With the editor of your choice, create and edit:
irods_capability_automated_ingest/examples/put_with_resc_name_image_metadata.py
Create a new ingest collection
imkdir put_coll
Custom Ingest with Metadata Extraction
Implement the usual event handler and operation
We will additional event handlers:
- post_data_obj_create
- post_data_obj_modify
Using the python iRODS client we will apply metadata extracted by exifread
Should the metadata key already exist we overwrite it with the new value
Custom Ingest with Metadata Extraction
import exifread from irods_capability_automated_ingest.core import Core from irods_capability_automated_ingest.utils import Operation def add_exif_metadata(session, target, path): with open(path, 'rb') as f: obj = session.data_objects.get(target) tags = exifread.process_file(f, details=False) for (k, v) in tags.items(): if k not in ('JPEGThumbnail','TIFFThumbnail','Filename','EXIF MakerNote'): if k in obj.metadata.keys(): obj.metadata[k] = iRODSMeta(k, v) else: obj.metadata.add(str(k), str(v), '') class event_handler(Core): @staticmethod def to_resource(session, target, path, **options): return "demoResc" @staticmethod def operation(session, target, path, **options): return Operation.PUT @staticmethod def post_data_obj_create(hdlr_mod, logger, session, target, path, **options): add_exif_metadata(session, target, path) @staticmethod def post_data_obj_modify(hdlr_mod, logger, session, target, path, **options): add_exif_metadata(session, target, path)
Custom Ingest with Metadata Extraction
Should we want this module to register in place, we can simply change the operation returned to REGISTER_SYNC or REGISTER_AS_REPLICA_SYNC
cp -r /tmp/test_dir ./src_dir/dir2
Stage new test data
python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/put_coll --event_handler irods_capability_automated_ingest.examples.put_with_resc_name_image_metadata
Launch the ingest job
Custom Ingest with Metadata Extraction
$ ils -l put_coll/dir2 /tempZone/home/rods/sync_coll/dir2: rods 0 demoResc 2157087 2018-05-22.11:16 & img0.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img1.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img2.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img3.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img4.jpg
Check our results
Custom Ingest with Metadata Extraction
$ imeta ls -d put_coll/dir2/img0.jpg AVUs defined for dataObj sync_coll/dir4/img0.jpg: attribute: EXIF ApertureValue value: 7983/3509 units: ---- attribute: EXIF BrightnessValue value: 2632/897 units: ---- <snip> ---- attribute: Thumbnail YResolution value: 72 units:
Inspect our newly harvested metadata
Any python library can now be leveraged to extract and apply metadata. Mime type can be detected and mapped to the appropriate metadata extraction
The Landing Zone
The Landing Zone
In this use case data is written to disk by an instrument or another source we can run an ingest job on that directory.
Once data is ingested it is moved out of the way in order to improve ingest performance.
These ingested files can be removed later as a matter of local administrative policy.
The critical different between a pure file system scan and a landing zone is that the LZ is not considered the single point of truth, it is a staging area for ingest (moving files out of the way).
In a file system scan the file system is the canonical replica and the catalog and other replicas are kept in sync.
The Landing Zone
Create a new ingest collection
imkdir lz_coll
Create the landing zone and ingested directories
mkdir /tmp/landing_zone
mkdir /tmp/ingested
Stage data for ingest
cp -r /tmp/test_dir /tmp/landing_zone/dir0
Preparing a Landing Zone
With the editor of your choice, create and edit:
irods_capability_automated_ingest/examples/lz_put_with_resc_name_image_metadata.py
The Landing Zone
import exifread import os from irods_capability_automated_ingest.core import Core from irods_capability_automated_ingest.utils import Operation def add_exif_metadata(session, target, path): with open(path, 'rb') as f: obj = session.data_objects.get(target) try: tags = exifread.process_file(f, details=False) for (k, v) in tags.items(): if k not in ('JPEGThumbnail','TIFFThumbnail','Filename','EXIF MakerNote'): if k in obj.metadata.keys(): obj.metadata[k] = iRODSMeta(k, v) else: obj.metadata.add(str(k), str(v), '') except: pass class event_handler(Core): @staticmethod def to_resource(session, target, path, **options): return "demoResc" @staticmethod def operation(session, target, path, **options): return Operation.PUT @staticmethod def post_data_obj_create(hdlr_mod, logger, session, target, path, **options): add_exif_metadata(session, target, path) new_path = path.replace('/tmp/landing_zone', '/tmp/ingested') try: dir_name = os.path.dirname(new_path) os.makedirs(dir_name, exist_ok=True) os.rename(path, new_path) except: logger.info('FAILED to move ['+path+'] to ['+new_path+']')
The Landing Zone
$ python3 -m irods_capability_automated_ingest.irods_sync start /tmp/landing_zone /tempZone/home/rods/lz_coll --event_handler irods_capability_automated_ingest.examples.lz_put_with_resc_name_image_metadata
Launch the ingest job
$ ls -l /tmp/ingested/dir0/ total 10540 -rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img0.jpg -rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img1.jpg -rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img2.jpg -rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img3.jpg -rw-rw-r-- 1 irods irods 2157087 May 23 05:24 img4.jpg
Check the ingested directory for the files
The Landing Zone
$ ils -l lz_coll/dir0 /tempZone/home/rods/lz_coll/dir0: rods 0 demoResc 2157087 2018-05-22.11:16 & img0.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img1.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img2.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img3.jpg rods 0 demoResc 2157087 2018-05-22.11:16 & img4.jpg
Check the ingest collection lz_coll
The Landing Zone
Check the ingested metadata
$ imeta ls -d lz_coll/dir2/img0.jpg AVUs defined for dataObj sync_coll/dir4/img0.jpg: attribute: EXIF ApertureValue value: 7983/3509 units: ---- attribute: EXIF BrightnessValue value: 2632/897 units: ---- <snip> ---- attribute: Thumbnail YResolution value: 72 units:
Streaming Ingest
For this use case some instruments periodically append output to a file or set of files.
We will configure an ingest job for a file to which a process will periodically append some data.
#!/bin/bash file=$1 touch $file for i in {1..1000}; do echo "do more science!!!!\n" | cat >> $file sleep 2 done
Edit append_data.sh (make sure its executable)
Streaming Ingest
Modify irods_capability_automated_ingest/examples/append_with_resc_name.py
from irods_capability_automated_ingest.core import Core from irods_capability_automated_ingest.utils import Operation class event_handler(Core): @staticmethod def to_resource(session, target, path, **options): return "regiResc2a" @staticmethod def operation(session, target, path, **options): return Operation.PUT_APPEND
"regiResc2a" should become "demoResc"
Streaming Ingest
Prepare the source directory
$ rm -r ./src_dir/*
In another terminal start the data creation process
$ ./append_data.sh ./src_dir/science.txt
Create a new ingest collection
$ imkdir stream_coll
Streaming Ingest
Check our results
$ ils -l stream_coll/science.txt rods 0 demoResc 374 2018-05-23.10:13 & science.txt $ ils -l stream_coll/science.txt rods 0 demoResc 418 2018-05-23.10:13 & science.txt $ ils -l stream_coll/science.txt rods 0 demoResc 440 2018-05-23.10:13 & science.txt <SNIP>
Start the ingest process with a restart interval of 1s
$ python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/stream_coll --event_handler irods_capability_automated_ingest.examples.append_with_resc_name -i 1
Streaming Ingest
Managing a periodic ingest job
$ python3 -m irods_capability_automated_ingest.irods_sync list 63099db0-5e93-11e8-b142-080027e8658d
Halting a periodic ingest job
$ python3 -m irods_capability_automated_ingest.irods_sync stop 63099db0-5e93-11e8-b142-080027e8658d
File System Scanning
File system Scanning
This implementation will periodically scan a source directory, register the data in place or update system metadata for changed files.
In this use case the file system is considered the single point of truth for ingest. Changes are detected during the scan, and the system metadata is updated within the catalog.
File system Scanning
Clean up the source directory and stage data
$ rm -r ./src_dir/*
$ cp -r /tmp/test_dir ./src_dir/dir0
Create a new target collection
$ imkdir scan_coll
Launch the scanner with a period of 1 second
$ python3 -m irods_capability_automated_ingest.irods_sync start ./src_dir /tempZone/home/rods/scan_coll --event_handler irods_capability_automated_ingest.examples.register -i 1
File system Scanning
Investigate our results
$ ils -l scan_coll/dir0 /tempZone/home/rods/scan_coll/dir0: rods 0 demoResc 2157087 2018-05-24.07:28 & img0.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img1.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img2.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img3.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img4.jpg
Stage new data and investigate data registration
$ cp -r /tmp/test_dir ./src_dir/dir1
$ ils -l scan_coll/
/tempZone/home/rods/scan_coll:
C- /tempZone/home/rods/scan_coll/dir0
C- /tempZone/home/rods/scan_coll/dir1
Wait for it...
File system Scanning
Investigate our results
$ ils -l scan_coll/dir1 /tempZone/home/rods/scan_coll/dir1: rods 0 demoResc 2157087 2018-05-24.07:28 & img0.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img1.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img2.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img3.jpg rods 0 demoResc 2157087 2018-05-24.07:28 & img4.jpg
If it is not running, in another terminal
$ ./append_data.sh ./src_dir/science.txt
Launch the append script to test streaming writes
File system Scanning
Investigate our results
$ ils -l scan_coll /tempZone/home/rods/scan_coll: rods 0 demoResc 154 2018-05-24.07:45 & science.txt C- /tempZone/home/rods/scan_coll/dir0 C- /tempZone/home/rods/scan_coll/dir1 $ ils -l scan_coll/science.txt rods 0 demoResc 264 2018-05-24.07:45 & science.txt $ ils -l scan_coll/science.txt rods 0 demoResc 308 2018-05-24.07:45 & science.txt $ ils -l scan_coll/science.txt rods 0 demoResc 308 2018-05-24.07:45 & science.txt
File system Scanning
Managing a periodic ingest job
$ python3 -m irods_capability_automated_ingest.irods_sync list b0f2d6fe-5f47-11e8-b142-080027e8658d
Halting a periodic ingest job
$ python3 -m irods_capability_automated_ingest.irods_sync stop b0f2d6fe-5f47-11e8-b142-080027e8658d
Questions?
Automated Ingest - GPN 2018
By jason coposky
Automated Ingest - GPN 2018
Training for the iRODS Automated Ingest Capability
- 1,805