Taking Compute to Data

June 5-7, 2018

iRODS User Group Meeting 2018

Durham, NC

Daniel Moore

dmoore@renci.org

Applications Engineer, iRODS Consortium

Taking Compute to Data

TODAY, 2018

Renaissance Computing Institute

UNC-Chapel Hill

Daniel Moore

dmoore@renci.org

Application Engineer, iRODS Consortium

The Compute to Data Use Case

Data is considered to be routed to appropriate storage

resource

Goals - Develop generic interface concept for compute

  • Develop a metadata driven interface for labeling resources which provide computational capabilities - ultimately relies upon convention
  • Where possible, separate configuration from implementation - isolate deployment specific concepts
  • Consider a rule base as an extension of iRODS - rules are not just data management policy

Goals - Develop a thumbnailing service for iRODS

  1. Assumes data is routed to the appropriate resource for compute
  2. Launch a docker container to generate thumbnails
  3. Post compute container:
    1. Register products in the resultant directory into iRODS
    2. Apply any metadata contained in resultant mdmanifest.json 

Implemented as an iRODS rulebase -

    following the  Template Method pattern

Components of the System

System Component

Container Technology

User Provided Compute

Post Compute Container 

Job Endpoint

Implementation

Singularity

docker://repo_name/thumbnail_image
shub://repo_name/metadata_addtags

iRODS Rule Base

(user extension of the iRODS API)

Getting Started

~/irods_training/advanced/hpc_compute_to_data/ubuntu_14_16/install_singularity.sh

Install Singularity:

As the ubuntu training account

Getting Started

mkdir build_compute_to_data
cd build_compute_to_data
cmake ../irods_training/advanced/hpc_compute_to_data
make package
sudo dpkg -i ./irods-hpc-compute-to-data-example_4.2.3~xenial_amd64.deb

Build and Install the Compute to Data package

git clone https://github.com/irods/irods_training
sudo apt-get -y install irods-externals-* irods-dev
export PATH=/opt/irods-externals/cmake3.5.2-0/bin/:$PATH

Getting Started - if necessary

mkdir build_data_to_compute
cd build_data_to_compute
cmake ../irods_training/advanced/hpc_data_to_compute/
make package
sudo dpkg -i ./irods-hpc-data-to-compute-example_4.2.3~xenial_amd64.deb

Build and Install the Data to Compute package

git clone https://github.com/irods/irods_training
sudo apt-get -y install irods-externals-* irods-dev
export PATH=/opt/irods-externals/cmake3.5.2-0/bin/:$PATH

Getting Started

cd
mkdir build_uuid
cd build_uuid
cmake ../irods_training/advanced/irods_microservice_plugin_uuid/
make package
sudo dpkg -i ./irods-get-uuid-4.2.1-ubuntu14-x86_64.deb

Build and Install the UUID microservice plugin

Text

Configure the rule engine plugin

 "rule_engines": [
     ...
         "re_rulebase_set": [
             "compute_to_data",
             "data_to_compute",
             "core"
         ],
     ...
 ]
sudo su - irods
vim /etc/irods/server_config.json

Disable SSL in the iRODS Server for Jargon

acPreConnect(*OUT) { *OUT="CS_NEG_DONT_CARE"; }

Edit /etc/irods/core.re

acPreConnect(*OUT) { *OUT="CS_NEG_REFUSE"; }

becomes

Configure the Tagged Resources - if necessary

Make two unix file system resources

 iadmin mkresc lts_resc unixfilesystem `hostname`:/tmp/irods/lts_resc
 iadmin mkresc img_resc unixfilesystem `hostname`:/tmp/irods/img_resc

Annotate them with appropriate metadata given their roles

  - defined in the configuration as part of the contract

 imeta add -R lts_resc COMPUTE_RESOURCE_ROLE LONG_TERM_STORAGE
 imeta add -R img_resc COMPUTE_RESOURCE_ROLE IMAGE_PROCESSING

Stage data for thumbnail creation

cp ~/irods_training/stickers.jpg /tmp

As the irods service account

As the ubuntu training account

The configuration interface

Define interfaces for any necessary conventions

  • Metadata attributes and values
  • Metadata values for implemented roles

Single Point of Truth - Template Method Pattern

  • execute defined preconditions
  • run user's requested container
  • execute moarlock service - product capture and metadata application

Users may utilize metadata conventions within a rule to provide inputs to the generalized container service

The Image Compute container

FROM python:3

# Define environment variable
ENV SOURCE_IMAGE default_source_image
ENV DESTINATION_IMAGE default_destination_image
ENV SIZE default_size
ENV DESTINATION_COLLECTION default_size 

WORKDIR /

ADD make_thumbnail.py /

RUN apt-get install -y libjpeg-dev && pip install pillow

# Run app.py when the container launches
CMD ["sh", "-c", "python ./make_thumbnail.py ${SIZE}x${SIZE} /src/${SOURCE_IMAGE} /dst/${DESTINATION_IMAGE} ${DESTINATION_COLLECTION}"]

The Python thumbnail utility

import os, sys
import json
import datetime
from PIL import Image


# capture incoming parameters
size_str = sys.argv[1]
src_name = sys.argv[2]
dst_name = sys.argv[3]
dst_coll = sys.argv[4]

# build size array
size_vals = size_str.split('x')
size = int(size_vals[0]), int(size_vals[1])

# generate thumbnail
try:
    im = Image.open(src_name)
    im.thumbnail(size)
    im.save(dst_name, im.format)
except IOError:
    print("cannot create thumbnail for: " + src_name)
    sys.exit()

1-2 : Build the Thumbnail

The Python thumbnail utility

# create metadata and build json representation for moarlock
avu_list = []

avu = {}
avu['attribute'] = 'source_image'
avu['value'] = os.path.basename(src_name)
avu['unit'] = ''
avu['irodsPath'] = os.path.basename(dst_name)
avu['action'] = 'ADD'
avu_list.append(avu)

avu = {}
avu['attribute'] = 'time_stamp'
avu['value'] = '{:%Y-%m-%d %H:%M:%S}'.format(datetime.datetime.now())
avu['unit'] = ''
avu['irodsPath'] = os.path.basename(dst_name)
avu['action'] = 'ADD'
avu_list.append(avu)

md_manifest = {}
md_manifest['operation'] = avu_list
md_manifest['failureMode'] = 'FAIL_FAST'
md_manifest['parentIrodsTargetPath'] = dst_coll

# write out manifest for moarlock
with open('/dst/mdmanifest.json', 'w') as outfile:
    json.dump(md_manifest, outfile, sort_keys=True, indent=4, separators=(',', ': '))

2-2 : Create the mdmanifest.json file for Moarlock

Example mdmanifest.json

{
    "operation":[
        {
            "attribute":"atr1",
            "value":"val1",
            "unit":"",
            "irodsPath":"stickers.jpg",
            "action":"ADD"
        }
    ],
    "failureMode":"FAIL_FAST",
    "parentIrodsTargetPath":"/tempZone/home/rods/"
}

The configuration interface

  • Get the metadata attribute string that holds the role
  • Get the tag for an Image Compute resource
  • Get the logical collection name for thumbnails
  • Get the name of a thumbnail
  • Get a list of desired thumbnail sizes

For the thumbnail service we will need to

The configuration interface

Utilize the interface for our chosen metadata convention

get_compute_resource_role_attribute(*t) {
    *t = "COMPUTE_RESOURCE_ROLE"
}
get_image_compute_type(*t) {
    *t = "IMAGE_PROCESSING"
}

The configuration interface

get_thumbnail_collection_name(*col_name, *obj_name,  *thumb_coll_name) {
    *fn = trimr(*obj_name, ".")
    *thumb_coll_name = *col_name ++ "/" ++ *fn ++ "_thumbnails"
}
get_thumbnail_name(*file_name, *size, *thumb_name) {
    # trim the extension
    *fn = trimr(*file_name, ".")
    *ext = substr(*file_name, strlen(*fn)+1, strlen(*file_name))
    *thumb_name = *fn ++ "_thumbnail_" ++ *size ++ "." ++ *ext
}
get_thumbnail_sizes(*size_list) {
    *size_list = list( "128x128", "256x256", "512x512", "1024x1024" )
}

Utilize the interface for naming conventions

Thumbnail Service - helper functions

split_path(*p, *tok, *col, *obj)
get_resource_host_by_id(*resc_id, *resc_host)
get_resc_id_for_data_object_resident_on_image_node(*obj_name, *col_name,
                  *compute_resc_role_attr, *image_compute_type, *src_resc_id)
get_phy_path_for_object_on_resc_id(*obj_name, *resc_id, *phy_path)

Leverage helper functions from Data to Compute

object_is_image_type(*_f, *_flag)
determine_destination_resource(*_obj_path)

Additional helper functions for Compute to Data

Routing the Data - helper function implementation

object_is_image_type(*_f, *_flag) {
    *_flag = false;
    if (*_f like "*.jpg" || *_f like "*.jpeg" || *_f like "*.bmp" ||
        *_f like "*.tif" || *_f like "*.tiff" || *_f like "*.rif" ||
        *_f like "*.gif" || *_f like "*.png"  || *_f like "*.svg" ||
        *_f like "*.xpm") {
        *_flag = true;
    }
}

Use file extension to determine image type

Other options: use a Tika service for mime-type, the Cyverse infotyper service, custom microserivce plugin

Routing the Data - match object type to resource

determine_destination_resource(*_obj_path) {
    *comp_attr = "NULL"
    get_compute_resource_role_attribute(*comp_attr);
    *image_flag = false;
    object_is_image_type(*_obj_path, *image_flag)
    *resc_name = "lts_resc" # discover LTS resc
    if(true == *image_flag) {
        *image_type = "NULL"
        get_image_compute_type(*image_type)
        get_resource_name_by_role(*resc_name, *comp_attr, *image_type)
    }
    msiSetDefaultResc(*resc_name,"forced");
}

Find the image processing resource and route the ingested data accordingly

Compute to Data Interface

Ingested data must route to appropriate resource given some criteria

acSetRescSchemeForCreate()
launch_compute_container(
    *host_name,
    *port_str,
    *guid_str,
    *src_phy_path,
    *dst_log_path,
    *container_name,
    *user_docker_options)
launch_thumbnail_compute(*src_obj_path)

Provide a template for containerized compute and an endpoint for creating thumbnails

Routing the Data

acSetRescSchemeForCreate {

    determine_destination_resource($objPath)

}

Override the static policy enforcement point for data object creation

Call our helper function, passing the session variable which holds the logical path of the data object

Compute to Data Template Method

launch_compute_container(
    *host_name,
    *port_str,
    *guid_str,
    *src_phy_path,
    *dst_log_path,
    *container_name,
    *user_docker_options) {

    remote(*host_name, "null") {
        # possible future pre-processing here

        # build the full docker option string
        *cmd_opt = *user_docker_options ++ " " ++ *container_name  ++ "\""

        # call the users provided container
        msiExecCmd("docker_run.sh", *cmd_opt, "null", "null", "null", *std_out_err)

1-2 : Launch the user defined container

Note - add a delay() directive for asynchronous behavior

Compute to Data Template Method

        # build option string for Moarlock
        split_path(*dst_log_path, "/", *dst_col_name, *dst_obj_name)
        *moar_opts = "\" -v " ++ *src_phy_path ++ ":/var/input -e host=" ++ *host_name ++ " -e zone=tempZone -e port=" ++ *port_str ++ " -e user=rods -e passwd=rods -e irodsout=" ++ *dst_col_name ++ " -e guid=" ++ *guid_str ++ " diceunc/moarlock:1.0\""

        # post-processing with  Moarlock
        msiExecCmd("docker_run.sh", *moar_opts, "null", "null", "null", *std_out_err)

    } # remote

} # launch_compute_container

2-2 : Build and launch the Moarlock container

Note: iRODS parameters are currently hardcoded - utilize tickets in future iterations

Compute to Data Thumbnail Rule

launch_thumbnail_compute(
    *src_obj_path ) {
    # TODO - ensure image is on image compute resource


    split_path(*src_obj_path, "/", *col_name, *obj_name)


    *thumb_coll_name = "NULL"
    get_thumbnail_collection_name(*col_name, *obj_name, *thumb_coll_name);

    if("NULL" == *image_compute_type) {
            failmsg(-1,"get_thumbnail_collection_name failed")
    }


    *guid_str = "NULL"
    msiget_uuid(*guid_str)


    *dst_dir_name = "/tmp/" ++ *obj_name ++ "-" ++ *guid_str

 

1-4 : capture target collection and UUID

Compute to Data Thumbnail Rule

 

    # capture configuration parameters
    *image_compute_type = "NULL"
    get_image_compute_type(*image_compute_type)
    if("NULL" == *image_compute_type) {
            failmsg(-1,"get_image_compute_type failed")
    }

    *compute_resc_role_attr = "NULL"
    get_compute_resource_role_attribute(*compute_resc_role_attr)
    if("NULL" == *compute_resc_role_attr) {
            failmsg(-1,"get_compute_resource_role_attribute failed")
    }

    *src_resc_id = "NULL"
    get_resc_id_for_data_object_reside_on_image_node(
            *obj_name,
            *col_name,
            *compute_resc_role_attr,
            *image_compute_type,
            *src_resc_id)

2-4 : capture metadata parameters and resource id

Compute to Data Thumbnail Rule

    *src_phy_path = "NULL"
    get_phy_path_for_object_on_resc_id(*obj_name, *src_resc_id, *src_phy_path)
    if("NULL" == *src_phy_path) {
        failmsg(-1,"get_phy_path_for_object_on_resc_id failed for <snip>")
    }
    split_path(*src_phy_path, "/", *src_dir_name, *src_file_name)

    *server_host = "NULL"
    get_resource_host_by_id(*src_resc_id, *server_host);
    if("NULL" == *server_host) {
        failmsg(-1,"get_resource_host_by_id failed for [*src_resc_id]")
    }

    get_thumbnail_sizes(*thumb_sizes)

 

3-4 : get source physical path, host name and sizes

Compute to Data Thumbnail Rule

4-4 : build docker option string and call container template

    foreach( *sz in *thumb_sizes ) {
        get_thumbnail_name(*obj_name, *sz, *thumbnail_name);
        *dst_obj_path = *thumb_coll_name ++ "/" ++ *thumbnail_name

        *sz_str = str(*sz)
        *docker_options = "\" -v " ++ *src_dir_name ++ ":/src -v " ++ *dst_dir_name
        *docker_options = *docker_options ++ ":/dst -e SIZE=" ++ *sz_str
        *docker_options = *docker_options ++ " -e SOURCE_IMAGE=" ++ *src_file_name
        *docker_options = *docker_options ++ " -e DESTINATION_IMAGE=" ++ *thumbnail_name
        *docker_options = *docker_options ++ " -e DESTINATION_COLLECTION=" ++ *thumb_coll_name ++ " "

        launch_compute_container(
            *server_host,
            "1247",
            *guid_str,
            *dst_dir_name,
            *dst_obj_path,
            "thumbnail",
            *docker_options)
    } # for
} # launch_thumbnail_compute

Compute to Data - Thumbnail Invocation Rule

launch_compute {
    *logical_path="/tempZone/home/rods/stickers.jpg"
    launch_thumbnail_compute(*logical_path)
}
INPUT null
OUTPUT ruleExecOut

launch_compute.r

Compute to Data - Thumbnail Testing

irods@icat:~$ iput -fR img_resc /tmp/stickers.jpg
irods@icat:~$ irule -F launch_compute.r
irods@icat:~$ ils -l
/tempZone/home/rods:
  rods              0 img_resc      2157087 2017-05-29.22:46 & stickers.jpg
  C- /tempZone/home/rods/stickers_thumbnails

irods@icat:~$ ils -l stickers_thumbnails
/tempZone/home/rods/stickers_thumbnails:
  rods              0 img_resc       229954 2017-05-29.22:46 & stickers_thumbnail_1024x1024.jpg
  rods              0 img_resc         6456 2017-05-29.22:46 & stickers_thumbnail_128x128.jpg
  rods              0 img_resc        19355 2017-05-29.22:46 & stickers_thumbnail_256x256.jpg
  rods              0 img_resc        63036 2017-05-29.22:46 & stickers_thumbnail_512x512.jpg

irods@icat:~$ irule -F find_thumbnails.r
thumbnail [/tempZone/home/rods/stickers_thumbnails/stickers_thumbnail_1024x1024.jpg]
thumbnail [/tempZone/home/rods/stickers_thumbnails/stickers_thumbnail_512x512.jpg]
thumbnail [/tempZone/home/rods/stickers_thumbnails/stickers_thumbnail_256x256.jpg]
thumbnail [/tempZone/home/rods/stickers_thumbnails/stickers_thumbnail_128x128.jpg]

Note: we need to use the -R img_resc as rodsadmin users are not affected by acSetRescSchemeForCreate

Future Work

  • Integrate tickets into the process
    • grant and pass tickets in the rule engine
  • Develop a security model via dynamic policy enforcement points
    • filter user executed rules for calls to docker, msiExecCmd, or another binary
  • Provide a mechanism for users to provide their own apps
  • Extending policy enforcement around the concept of computation - enforcing the sandbox 

UGM 2018 - Taking Compute to Data

By Daniel Moore

UGM 2018 - Taking Compute to Data

Training to accompany the one page data management design pattern: https://irods.org/images/compute_to_data.jpg

  • 1,706