Moving Code to Data
with workflow languages and containers
My thesis project set out to test some things:
- Feasibility of OpenStack for research.
- Using common workflow language (CWL) on top of OpenStack.
- Using Docker containers to provide tools for easy researcher pipeline execution.
- Utilizing object storage on top of OpenStack to provide data for remote pipeline execution.
Sharing of data is crucial in scientific community.
Increase of collaborative and reproducible work.
Data volumes are growing at substantial rates.
Storage devices are becoming considerably cheaper.
Predominantly in the business field but reaching more and more into research.
More research oriented solutions popping up.
Still relatively expensive for high throughput.
More and more scientific data being held in cloud environments.
Traditional Way (vice versa)
Data sizes grow faster than internet speeds (especially the case in developing countries).
Data governance issues with public clouds.
Researchers often need to know low level technical stuff, or work with someone who does.
Structured workflow definitions.
Reproducible code and workflows.
Many growing in support for containerization.
Easier to support from a systems perspective.
Data exists at the institute.
Institute provides a cloud environment.
Open source and free (if not using commercial offering).
Gaining traction as an alternative to using commercial products.
South Africa has projects currently using or looking at utilizing OpenStack:
ARC - Now called SADIRC.
Light-weight alternative to virtualization.
Reproducible software environments.
Easily shareable code.
Write once, run anywhere.
Some container systems are build specifically for science.
Quick Overview of Vitual Machines vs. Containers
Submit workflow definitions.
Specify run-time arguments for workflow execution.
Select data from list of data authorized for their use.
Select where the resultant data is sent to.
Use the system without knowledge of back-end cloud infrastructure or use of said cloud environment.
Autonomously deploy containers from workflow definition and execute workflow.
Ran a CWL file with a Dockerized Fastqc tool.
Short read data file from NCBI was used as example (NA12878).
Input data was 2.77 GB, preloaded on cloud environment.
Resultant data was 650 KB, generated on cloud environment.
This project was a simple proof of concept
This concept has a bright future
Prof. A Christoffels, Peter van Heusden
UWC Astrophysics and ICS departments
Reach Me At:
By Eugene de Beste