Taking Compute to Data
June 5-7, 2018
iRODS User Group Meeting 2018
Durham, NC
Daniel Moore
Applications Engineer, iRODS Consortium
Taking Compute to Data
The Compute to Data Use Case
Data is assumed to already be routed to an appropriate storage resource
(this presentation under construction)
Goals - Develop generic interface concept for compute
- Develop a metadata driven interface for labeling resources which provide computational capabilities
- ultimately relies upon convention
- Separate configuration from implementation
- isolate deployment specific concepts
- Consider a rule base as an extension of iRODS
- rules are not just data management policy
"Compute To Data" Pattern - Salient Features
Implemented as an iRODS rulebase -
following the Template Method pattern
- Assume data already at rest in the appropriate resource for compute
- Launch compute container (thumbnail_image) to process input data - container is legacy docker launched via Singularity & SLURM
- Post compute container (metadata_addtags) launched from SLURM epilog:
- Registers products in the resultant directory into iRODS
- Applies any metadata contained in resultant mdmanifest.json
Components of the System
System Component
Container Technology
User Provided Compute
Post Compute Container
Job Endpoint
Implementation
Singularity
docker://repo_name/thumbnail_image shub://repo_name/metadata_addtags
iRODS Rule Base
(user extension of the iRODS API)
Getting Started
mkdir ~/build_data_to_compute
cd ~/build_data_to_compute
cmake ../irods_training/advanced/hpc_data_to_compute/
make package
sudo dpkg -i ./irods-hpc-data-to-compute-example_4.2.3~xenial_amd64.deb
As prerequisite, install the Data to Compute ...
~/irods_training/advanced/hpc_data_to_compute/ubuntu_16/install_munge_and_slurm.sh
and Install Munge and SLURM:
git clone https://github.com/irods/irods_training sudo apt-get -y install irods-externals-* irods-dev export PATH=/opt/irods-externals/cmake3.5.2-0/bin/:$PATH
Install irods_training repository:
If necessary:
Getting Started
mkdir build_compute_to_data cd build_compute_to_data cmake ../irods_training/advanced/hpc_compute_to_data make package sudo dpkg -i ./irods-hpc-compute-to-data-example_4.2.3~xenial_amd64.deb
Install package for the compute-to-data example
~/irods_training/advanced/hpc_compute_to_data/ubuntu_14_16/install_singularity.sh
Download and build Singularity:
Getting Started - Python extensions to iRODS
sudo pip install python-irodsclient
... and the python irods-client module:
sudo apt-get -y install python-pip
Also, install python's pip package:
Make sure the Python rule engine plugin is installed.
sudo apt-get -y install irods-rule-engine-plugin-python
Configure the rule engine plugin
"rule_engines": [ ... "re_rulebase_set": [ "route_data", "compute_to_data", "data_to_compute", "core" ], ... ]
sudo su - irods nano /etc/irods/server_config.json
As service account user irods edit the server config:
Further Setup and Configuration
Place Python Rule Engine stanza after native RE stanza:
sudo nano /etc/irods/server_config.json
"rule_engines": [
{
"instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
"plugin_name": "irods_rule_engine_plugin-irods_rule_language",
...
"shared_memory_instance": "irods_rule_language_rule_engine"
},
{
"instance_name": "irods_rule_engine_plugin-python-instance",
"plugin_name": "irods_rule_engine_plugin-python",
"plugin_specific_configuration": {}
},
. . .
Move python rule-set into place:
cp /etc/irods/core.py.compute_to_data /etc/irods/core.py
Continued... Data-to-Compute Set-up / Configuration
iadmin mkuser alice rodsuser iadmin moduser alice password alicepass
This demonstration will be run as rodsuser 'alice'
Configure the Tagged Resources - if necessary
Make two unix file system resources
iadmin mkresc lts_resc unixfilesystem `hostname`:/tmp/irods/lts_resc iadmin mkresc img_resc unixfilesystem `hostname`:/tmp/irods/img_resc
Annotate them with appropriate metadata given their roles
- defined in the configuration as part of the contract
Stage data for thumbnail creation
mkdir /tmp/irods sudo chown irods:irods /tmp/irods cp ~/irods_training/stickers.jpg /tmp
As the irods service account
As the ubuntu training account
imeta add -R lts_resc COMPUTE_RESOURCE_ROLE LONG_TERM_STORAGE imeta add -R img_resc COMPUTE_RESOURCE_ROLE IMAGE_PROCESSING
Finally ...
ubuntu$ iinit ERROR: environment_properties::capture: missing environment file. should be at [/home/ubuntu/.irods/irods_environment.json] One or more fields in your iRODS environment file (irods_environment.json) are missing; please enter them. Enter the host name (DNS) of the server to connect to: localhost Enter the port number: 1247 Enter your irods user name: alice Enter your irods zone: tempZone Those values will be added to your environment file (for use by other iCommands) if the login succeeds. Enter your current iRODS password:
Remember to log in as 'alice' in the ubuntu training account:
The configuration interface
Define interfaces for any necessary conventions
- Metadata attributes and values
- Metadata values for implemented roles
Single Point of Truth - Template Method Pattern
- execute defined preconditions
- run user's requested container
- execute "metadata_addtags" service
- product capture and metadata application
Users may utilize metadata conventions within a rule to provide inputs to the generalized container service.
The Image Compute container
FROM python:3 # Define environment variable ENV SOURCE_IMAGE default_source_image ENV DESTINATION_IMAGE default_destination_image ENV SIZE default_size ENV DESTINATION_COLLECTION default_size WORKDIR / ADD make_thumbnail.py / RUN apt-get install -y libjpeg-dev && pip install pillow # Run make_thumbnail.py when the container launches CMD ["sh", "-c", "python ./make_thumbnail.py ${SIZE}x${SIZE} /src/${SOURCE_IMAGE} /dst/${DESTINATION_IMAGE} ${DESTINATION_COLLECTION}"]
The Python thumbnail utility
import os, sys import json import datetime from PIL import Image # capture incoming parameters size_str = sys.argv[1] src_name = sys.argv[2] dst_name = sys.argv[3] dst_coll = sys.argv[4] # build size array size_vals = size_str.split('x') size = int(size_vals[0]), int(size_vals[1]) # generate thumbnail try: im = Image.open(src_name) im.thumbnail(size) im.save(dst_name, im.format) except IOError: print("cannot create thumbnail for: " + src_name) sys.exit()
1-2 : Build the Thumbnail
The Python thumbnail utility
# create metadata and build json representation for moarlock avu_list = [] avu = {} avu['attribute'] = 'source_image' avu['value'] = os.path.basename(src_name) avu['unit'] = '' avu['irodsPath'] = os.path.basename(dst_name) avu['action'] = 'ADD' avu_list.append(avu) avu = {} avu['attribute'] = 'time_stamp' avu['value'] = '{:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now()) avu['unit'] = '' avu['irodsPath'] = os.path.basename(dst_name) avu['action'] = 'ADD' avu_list.append(avu) md_manifest = {} md_manifest['operation'] = avu_list md_manifest['failureMode'] = 'FAIL_FAST' md_manifest['parentIrodsTargetPath'] = dst_coll # write out manifest for moarlock with open('/dst/mdmanifest.json', 'w') as outfile: json.dump(md_manifest, outfile, sort_keys=True, indent=4, separators=(',', ': '))
2-2 : Create the mdmanifest.json file for Moarlock
Example mdmanifest.json
{ "operation":[ { "attribute":"atr1", "value":"val1", "unit":"", "irodsPath":"stickers.jpg", "action":"ADD" } ], "failureMode":"FAIL_FAST", "parentIrodsTargetPath":"/tempZone/home/rods/" }
The configuration interface
For the thumbnail service we will need to:
-
Get the metadata attribute string that holds the role
-
Get the tag for an Image Compute resource
-
Get the logical collection name for thumbnails
-
Get the name of a thumbnail
-
Get a list of desired thumbnail sizes
Other Compute->Data components
metadata_addtags
The configuration interface
Utilize the interface for our chosen metadata convention
get_compute_resource_role_attribute(*t) { *t = "COMPUTE_RESOURCE_ROLE" } get_image_compute_type(*t) { *t = "IMAGE_PROCESSING" }
The configuration interface
get_thumbnail_collection_name(*col_name, *obj_name, *thumb_coll_name) {
*fn = trimr(*obj_name, ".")
*thumb_coll_name = *col_name ++ "/" ++ *fn ++ "_thumbnails"
}
get_thumbnail_name(*file_name, *size, *thumb_name) {
# trim the extension
*fn = trimr(*file_name, ".")
*ext = substr(*file_name, strlen(*fn)+1, strlen(*file_name))
*thumb_name = *fn ++ "_thumbnail_" ++ *size ++ "." ++ *ext
}
get_thumbnail_sizes(*size_list) {
*size_list = list( "128x128", "256x256", "512x512", "1024x1024" )
}
Utilize the interface for naming conventions
Thumbnail Service - helper functions
split_path(*p, *tok, *col, *obj) get_resource_host_by_id(*resc_id, *resc_host) get_resc_id_for_data_object_resident_on_image_node(*obj_name, *col_name, *compute_resc_role_attr, *image_compute_type, *src_resc_id) get_phy_path_for_object_on_resc_id(*obj_name, *resc_id, *phy_path)
Leverage helper functions from Data to Compute
object_is_image_type(*_f, *_flag) determine_destination_resource(*_obj_path)
Additional helper functions for Compute to Data
Routing the Data - helper function implementation
object_is_image_type(*_f, *_flag) { *_flag = false; if (*_f like "*.jpg" || *_f like "*.jpeg" || *_f like "*.bmp" || *_f like "*.tif" || *_f like "*.tiff" || *_f like "*.rif" || *_f like "*.gif" || *_f like "*.png" || *_f like "*.svg" || *_f like "*.xpm") { *_flag = true; } }
Use file extension to determine image type
Other options:
- use a Tika service for mime-type
- the Cyverse infotyper service
- custom microserivce plugin
Routing the Data - match object type to resource
determine_destination_resource(*_obj_path) { *comp_attr = "NULL" get_compute_resource_role_attribute(*comp_attr); *image_flag = false; object_is_image_type(*_obj_path, *image_flag) *resc_name = "lts_resc" # discover LTS resc if(true == *image_flag) { *image_type = "NULL" get_image_compute_type(*image_type) get_resource_name_by_role(*resc_name, *comp_attr, *image_type) } msiSetDefaultResc(*resc_name,"forced"); }
Find the image processing resource and route the ingested data accordingly
Compute to Data Interface
Ingested data must route to appropriate
resource given some criteria
acSetRescSchemeForCreate()
launch_compute_container(
*host_name, *port_str, *guid_str, *src_phy_path, *dst_log_path, *container_name, *user_docker_options) launch_thumbnail_compute(*src_obj_path)
Provide a template for containerized compute
and an endpoint for creating thumbnails
Routing the Data
acSetRescSchemeForCreate {
determine_destination_resource($objPath)
}
Override the static policy enforcement point for data object creation
Call our helper function, passing the session variable which holds the logical path of the data object
Compute to Data Template Method
launch_compute_container(
*host_name,
*port_str,
*guid_str,
*src_phy_path,
*dst_log_path,
*container_name,
*user_docker_options) {
remote(*host_name, "null") {
# possible future pre-processing here
# build the full docker option string
*cmd_opt = *user_docker_options ++ " " ++ *container_name ++ "\""
# call the users provided container
msiExecCmd("docker_run.sh", *cmd_opt, "null", "null", "null", *std_out_err)
1-2 : Launch the user defined container
Note - add a delay() directive for asynchronous behavior
Compute to Data Thumbnail Rule
launch_thumbnail_compute( *src_obj_path ) { # TODO - ensure image is on image compute resource split_path(*src_obj_path, "/", *col_name, *obj_name) *thumb_coll_name = "NULL" get_thumbnail_collection_name(*col_name, *obj_name, *thumb_coll_name); if("NULL" == *image_compute_type) { failmsg(-1,"get_thumbnail_collection_name failed") } *guid_str = "NULL" msiget_uuid(*guid_str) *dst_dir_name = "/tmp/" ++ *obj_name ++ "-" ++ *guid_str
1-4 : capture target collection and UUID
Compute to Data Template Method
# build option string for Moarlock split_path(*dst_log_path, "/", *dst_col_name, *dst_obj_name) *moar_opts = "\" -v " ++ *src_phy_path ++ ":/var/input -e host=" ++ *host_name ++ " -e zone=tempZone -e port=" ++ *port_str ++ " -e user=rods -e passwd=rods -e irodsout=" ++ *dst_col_name ++ " -e guid=" ++ *guid_str ++ " diceunc/moarlock:1.0\"" # post-processing with Moarlock msiExecCmd("docker_run.sh", *moar_opts, "null", "null", "null", *std_out_err) } # remote } # launch_compute_container
2-2 : Build and launch the Moarlock container
Note - iRODS parameters are currently hardcoded - utilize tickets in future iterations
Compute to Data Thumbnail Rule
# capture configuration parameters *image_compute_type = "NULL" get_image_compute_type(*image_compute_type) if("NULL" == *image_compute_type) { failmsg(-1,"get_image_compute_type failed") } *compute_resc_role_attr = "NULL" get_compute_resource_role_attribute(*compute_resc_role_attr) if("NULL" == *compute_resc_role_attr) { failmsg(-1,"get_compute_resource_role_attribute failed") } *src_resc_id = "NULL" get_resc_id_for_data_object_reside_on_image_node( *obj_name, *col_name, *compute_resc_role_attr, *image_compute_type, *src_resc_id)
2-4 : capture metadata parameters and resource id
Compute to Data Thumbnail Rule
*src_phy_path = "NULL" get_phy_path_for_object_on_resc_id(*obj_name, *src_resc_id, *src_phy_path) if("NULL" == *src_phy_path) { failmsg(-1,"get_phy_path_for_object_on_resc_id failed for <snip>") } split_path(*src_phy_path, "/", *src_dir_name, *src_file_name) *server_host = "NULL" get_resource_host_by_id(*src_resc_id, *server_host); if("NULL" == *server_host) { failmsg(-1,"get_resource_host_by_id failed for [*src_resc_id]") } get_thumbnail_sizes(*thumb_sizes)
3-4 : get source physical path, host name and sizes
Compute to Data Thumbnail Rule
4-4 : build docker option string and call container template
foreach( *sz in *thumb_sizes ) { get_thumbnail_name(*obj_name, *sz, *thumbnail_name); *dst_obj_path = *thumb_coll_name ++ "/" ++ *thumbnail_name *sz_str = str(*sz) *docker_options = "\" -v " ++ *src_dir_name ++ ":/src -v " ++ *dst_dir_name *docker_options = *docker_options ++ ":/dst -e SIZE=" ++ *sz_str *docker_options = *docker_options ++ " -e SOURCE_IMAGE=" ++ *src_file_name *docker_options = *docker_options ++ " -e DESTINATION_IMAGE=" ++ *thumbnail_name *docker_options = *docker_options ++ " -e DESTINATION_COLLECTION=" ++ *thumb_coll_name ++ " " launch_compute_container( *server_host, "1247", *guid_str, *dst_dir_name, *dst_obj_path, "thumbnail", *docker_options) } # for } # launch_thumbnail_compute
Compute to Data - Thumbnail Invocation Rule
launch_compute { *logical_path="/tempZone/home/rods/stickers.jpg" launch_thumbnail_compute(*logical_path) } INPUT null OUTPUT ruleExecOut
launch_compute.r
Compute to Data - Thumbnail Testing
## BEGIN_SECTION -- compute to data -- post_processing_options=$(get_irods_slurm_var "post_processing") if [ $? -eq 0 ] ; then CMD="/var/lib/irods/msiExecCmd_bin/wrap_singularity exec metadata_addtags" CMD+=" $post_processing_options " STATUS="" ; $CMD ; STATUS="$?" { echo "$(date) - status=($STATUS) after post-processing with:" echo " '$CMD'" ; } >>/tmp/epilog fi
Delete comment-out characters (initial "#") in compute-to-data section of the epilog script:
Now we're ready to test....
sudo nano /var/lib/irods/compute/slurm_epilog
Compute to Data - Thumbnail Testing
ubuntu $ iput /tmp/stickers.jpg danielm@daniel-Ub16:~ $ ils -l /tempZone/home/alice: alice 0 img_resc 2157087 2018-06-21.07:36 & stickers.jpg ubuntu $ irule -F ~irods/spawn_remote_containers.r ubuntu $ ils -l /tempZone/home/alice: alice 0 img_resc 2157087 2018-07-01.10:08 & stickers.jpg C- /tempZone/home/alice/stickers_thumbnails ubuntu $ ils -l stickers_thumbnails /tempZone/home/alice/stickers_thumbnails: alice 1 lts_resc 77352 2018-07-01.10:17 & stickers_1024x1024.jpg alice 1 lts_resc 3273 2018-07-01.10:17 & stickers_128x128.jpg alice 1 lts_resc 8413 2018-07-01.10:17 & stickers_256x256.jpg alice 1 lts_resc 23808 2018-07-01.10:17 & stickers_512x512.jpg ubuntu $ imeta ls -d stickers_thumbnails/stickers_512x512.jpg AVUs defined for dataObj stickers_thumbnails/stickers_512x512.jpg: attribute: source_image value: stickers.jpg units: ---- attribute: time_stamp value: 2018-07-01 14:17:14 units:
Future Work
-
Integrate tickets into the process
-
grant and pass tickets in the rule engine
-
-
Develop a security model via dynamic policy enforcement points
-
filter user-executed rules for calls to
-
docker
-
msiExecCmd()
-
another binary
-
-
-
Provide a mechanism for users to provide their own apps
-
registry and API
-
-
Extend policy enforcement around the concept of computation
-
enforcing the sandbox
-
UGM 2018 - Taking Compute to Data
By iRODS Consortium
UGM 2018 - Taking Compute to Data
iRODS User Group Meeting 2018 - Advanced Training Module https://irods.org/images/compute_to_data.jpg
- 1,895