Moving Code to Data
with workflow languages and containers
Context
Goals
My thesis project set out to test some things:
- Feasibility of OpenStack for research.
- Using common workflow language (CWL) on top of OpenStack.
- Using Docker containers to provide tools for easy researcher pipeline execution.
- Utilizing object storage on top of OpenStack to provide data for remote pipeline execution.
Data
-
Sharing of data is crucial in scientific community.
-
Increase of collaborative and reproducible work.
-
Data volumes are growing at substantial rates.
-
Storage devices are becoming considerably cheaper.
Cloud
Predominantly in the business field but reaching more and more into research.
More research oriented solutions popping up.
Still relatively expensive for high throughput.
More and more scientific data being held in cloud environments.
Justification
Traditional Way (vice versa)
Problems?
-
Data sizes grow faster than internet speeds (especially the case in developing countries).
-
Data governance issues with public clouds.
-
Researchers often need to know low level technical stuff, or work with someone who does.
Workflow Languages
Structured workflow definitions.
Reproducible code and workflows.
Easier sharing.
Many growing in support for containerization.
Easier to support from a systems perspective.
Masters Thesis
Assumptions
-
Data exists at the institute.
-
Institute provides a cloud environment.
OpenStack
-
Self-hosted cloud.
-
Open source and free (if not using commercial offering).
-
Gaining traction as an alternative to using commercial products.
-
South Africa has projects currently using or looking at utilizing OpenStack:
-
ARC - Now called SADIRC.
-
IDIA.
-
Software Containers
-
Light-weight alternative to virtualization.
-
Reproducible software environments.
-
Easily shareable code.
-
Write once, run anywhere.
-
Some container systems are build specifically for science.
Quick Overview of Vitual Machines vs. Containers
(docker.io)
Proposed Solution
-
Submit workflow definitions.
-
Specify run-time arguments for workflow execution.
-
Select data from list of data authorized for their use.
-
Select where the resultant data is sent to.
-
Use the system without knowledge of back-end cloud infrastructure or use of said cloud environment.
-
Autonomously deploy containers from workflow definition and execute workflow.
Implementation
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
requirements:
- class: DockerRequirement
dockerPull: quay.io/ncigdc/fastqc:1
- class: InlineJavascriptRequirement
class: CommandLineTool
inputs:
- id: adapters
type: ["null", File]
inputBinding:
prefix: --adapters
- id: casava
type: ["null", boolean]
default: false
inputBinding:
prefix: --casava
- id: contaminants
type: ["null", File]
inputBinding:
prefix: --contaminants
- id: dir
type: string
default: .
inputBinding:
prefix: --dir
- id: extract
type: boolean
default: false
inputBinding:
prefix: --extract
- id: format
type: string
default: fastq
inputBinding:
prefix: --format
- id: INPUT
type: File
format: "edam:format_2182"
inputBinding:
position: 99
- id: kmers
type: ["null", File]
inputBinding:
prefix: --kmers
- id: limits
type: ["null", File]
inputBinding:
prefix: --limits
- id: nano
type: boolean
default: false
inputBinding:
prefix: --nano
- id: noextract
type: boolean
default: true
inputBinding:
prefix: --noextract
- id: nofilter
type: boolean
default: false
inputBinding:
prefix: --nofilter
- id: nogroup
type: boolean
default: false
inputBinding:
prefix: --nogroup
- id: outdir
type: string
default: .
inputBinding:
prefix: --outdir
- id: quiet
type: boolean
default: false
inputBinding:
prefix: --quiet
- id: threads
type: int
default: 1
inputBinding:
prefix: --threads
outputs:
- id: OUTPUT
type: File
outputBinding:
glob: |
${
function endsWith(str, suffix) {
return str.indexOf(suffix, str.length - suffix.length) !== -1;
}
var filename = inputs.INPUT.nameroot;
if ( endsWith(filename, '.fq') ) {
var nameroot = filename.slice(0,-3);
}
else if ( endsWith(filename, '.fastq') ) {
var nameroot = filename.slice(0,-6);
}
else {
var nameroot = filename;
}
var output = nameroot +"_fastqc.zip";
return output
}
baseCommand: [/usr/local/FastQC/fastqc]
-
Ran a CWL file with a Dockerized Fastqc tool.
-
Short read data file from NCBI was used as example (NA12878).
-
Input data was 2.77 GB, preloaded on cloud environment.
-
Resultant data was 650 KB, generated on cloud environment.
-
Result
test data
result data
Findings
This project was a simple proof of concept
This concept has a bright future
Thanks
Prof. A Christoffels, Peter van Heusden
UWC Astrophysics and ICS departments
Reach Me At:
eugene@sanbi.ac.za
https://themeanti.me
GA2018
By Eugene de Beste
GA2018
- 503