Stian Soiland-Reyes
eScience lab, The University of Manchester
https://orcid.org/0000-0001-9842-9718
https://slides.com/soilandreyes/2018-12-11-cwl/
BioExcel/MolSSI Workshop on
Workflows in Biomolecular Simulations
2018-12-11, Barcelona
This work is licensed under a
Creative Commons Attribution 4.0 International License.
This work has been done as part of the BioExcel CoE (www.bioexcel.eu), a project funded by the European Union contract H2020-EINFRA-2015-1-675728.
Findable
Accessible
Interoperable
Reusable
To be Findable:
F1. (meta)data are assigned a globally unique and persistent identifier
F2. data are described with rich metadata (defined by R1 below)
F3. metadata clearly and explicitly include the identifier of the data it describes
F4. (meta)data are registered or indexed in a searchable resource
To be Accessible:
A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata are accessible, even when the data are no longer available
To be Interoperable:
I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data
To be Reusable:
R1. meta(data) are richly described with a plurality of accurate and relevant attributes
R1.1. (meta)data are released with a clear and accessible data usage license
R1.2. (meta)data are associated with detailed provenance
R1.3. (meta)data meet domain-relevant community standards
cwlVersion: v1.0
class: Workflow
inputs:
inp: File
ex: string
outputs:
classout:
type: File
outputSource: compile/classfile
steps:
untar:
run: tar-param.cwl
in:
tarfile: inp
extractfile: ex
out: [example_out]
compile:
run: arguments.cwl
in:
src: untar/example_out
out: [classfile]
cwlVersion: v1.0
class: Workflow
inputs:
toConvert: File
outputs:
converted:
type: File
outputSource: convertMethylation/converted
combined:
type: File
outputSource: mergeSymmetric/combined
steps:
convertMethylation:
run: interconverter.cwl
in:
toConvert: toConvert
out: [converted]
mergeSymmetric:
run: symmetriccpgs.cwl
in:
toCombine: convertMethylation/converted
out: [combined]
cwlVersion: v1.0
class: CommandLineTool
baseCommand: interconverter.sh
hints:
- class: DockerRequirement
dockerPull: "quay.io/neksa/screw-tool"
arguments: ["-d", $(runtime.outdir)]
inputs:
toConvert:
type: File
inputBinding:
prefix: -i
outputs:
converted:
type: File
outputBinding:
glob: "*.meth"
cwlVersion: v1.0
class: CommandLineTool
baseCommand: symmetriccpgs.sh
arguments: ["-d", $(runtime.outdir)]
hints:
- class: DockerRequirement
dockerPull: "quay.io/neksa/screw-tool"
inputs:
toCombine:
type: File
inputBinding:
prefix: -i
outputs:
combined:
type: File
outputBinding:
glob: "*.sym"
cwlVersion: v1.0
class: Workflow
label: EMG QC workflow, (paired end version). Benchmarking with MG-RAST expt.
requirements:
- class: SubworkflowFeatureRequirement
- class: SchemaDefRequirement
types:
- $import: ../tools/FragGeneScan-model.yaml
- $import: ../tools/trimmomatic-sliding_window.yaml
- $import: ../tools/trimmomatic-end_mode.yaml
- $import: ../tools/trimmomatic-phred.yaml
inputs:
reads:
type: File
format: edam:format_1930 # FASTQ
outputs:
processed_sequences:
type: File
outputSource: clean_fasta_headers/sequences_with_cleaned_headers
steps:
trim_quality_control:
doc: |
Low quality trimming (low quality ends and sequences with < quality scores
less than 15 over a 4 nucleotide wide window are removed)
run: ../tools/trimmomatic.cwl
in:
reads1: reads
phred: { default: '33' }
leading: { default: 3 }
trailing: { default: 3 }
end_mode: { default: SE }
minlen: { default: 100 }
slidingwindow:
default:
windowSize: 4
requiredQuality: 15
out: [reads1_trimmed]
convert_trimmed-reads_to_fasta:
run: ../tools/fastq_to_fasta.cwl
in:
fastq: trim_quality_control/reads1_trimmed
out: [ fasta ]
clean_fasta_headers:
run: ../tools/clean_fasta_headers.cwl
in:
sequences: convert_trimmed-reads_to_fasta/fasta
out: [ sequences_with_cleaned_headers ]
$namespaces:
edam: http://edamontology.org/
s: http://schema.org/
$schemas:
- http://edamontology.org/EDAM_1.16.owl
- https://schema.org/docs/schema_org_rdfa.html
s:license: "https://www.apache.org/licenses/LICENSE-2.0"
s:copyrightHolder: "EMBL - European Bioinformatics Institute"
.. and portability
cwltool: Local (Linux, OS X, Windows)
Arvados: AWS, GCP, Azure, Slurm
Toil: AWS, Azure, GCP, Grid Engine, LSF, Mesos, OpenStack, Slurm, PBS/Torque
Rabix Bunny: Local(Linux, OS X), GA4GH TES
cwl-tes: Local, GCP, AWS, HTCondor, Grid Engine, PBS/Torque, Slurm
CWL-Airflow: Linux, OS X
REANA: Kubernetes, CERN OpenStack
cromwell: local, HPC, Google, HtCondor
CWLEXEC: IBM Spectrum LSF
XENON: any Xenon backend: local, ssh, SLURM, Torque, Grid Engine
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: ["gmx", "pdb2gmx"]
arguments: ["-o", "processed.gro"]
inputs:
pdb:
type: string
inputBinding:
prefix: -f
water:
type: File
inputBinding:
prefix: -water
default: spce
outputs:
processed_gro:
type: File
outputBinding:
glob: processed.gro
cwlVersion: v1.0
class: CommandLineTool
baseCommand: node
hints:
DockerRequirement:
dockerPull: gromacs/gromacs:2018.4
hints:
SoftwareRequirement:
packages:
- package: gromacs
version:
- '2018.4'
specs:
- https://packages.debian.org/gromacs
- https://anaconda.org/bioconda/gromacs
- https://bio.tools/gromacs
- https://identifiers.org/rrid/RRID:SCR_014565
- https://hpc.example.edu/modules/gromacs/2018.4
DockerRequirement:
dockerPull: gromacs/gromacs:2018.4
class: CommandLineTool
hints:
SoftwareRequirement:
packages:
gromacs:
version: [ "2018.4" ]
baseCommand: ["gmx", "pdb2gmx"]
#..
module load gromacs/2018.4
apt-get install gromacs=2018.4*
conda install gromacs=2018.4
<dependency_resolvers>
<modules modulecmd="/opt/bin/modulecmd" />
<tool_shed_packages />
<galaxy_packages />
<conda />
<modules modulecmd="/opt/bin/modulecmd" versionless="true" />
<galaxy_packages versionless="true" />
<conda versionless="true" />
</dependency_resolvers>
tool_dependency_dir/
gromacs/
2018.4/
bin/
env.sh
Dependency resolution by CWLTool and Toil
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
requirements:
InlineJavascriptRequirement: {}
inputs: []
outputs:
example_out:
type: stdout
stdout: output.txt
arguments:
- prefix: -A
valueFrom: $(1+1)
- prefix: -B
valueFrom: $("/foo/bar/baz".split('/').slice(-1)[0])
- prefix: -C
valueFrom: |
${
var r = [];
for (var i = 10; i >= 1; i--) {
r.push(i);
}
return r;
}
arguments: ['--cpus', $(runtime.cores)]
requirements:
- class: ResourceRequirement
ramMin: 42000
coresMin: 1
tmpdirMin: 40000
- class: InlineJavascriptRequirement
Secondary files
Inline files
Directories
Arrays of files
Enums
JSON records (string, int, array, dict)
Workflow Engine is free on how to handle data management, e.g. data store or file copying
Optional inputs (tool reuse)
Sub-workflows (reuse workflow as a tool)
Task Parallelization (step execute when data is ready)
Scattering (multiple tasks from single array)
Merge (multiple inputs to single array)
towards CWL 2.x
Conditional branching (switch statement)
Looping
Who ran it?
When did it run?
Where did it run?
What workflow ran?
Which tool versions?
What data was created?
Khan et al,
Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv
Submitted to GigaScience
https://doi.org/10.5281/zenodo.1966881
$ cwlprov --help
usage: cwlprov [-h] [--version] [--directory DIRECTORY] [--relative]
[--absolute] [--output OUTPUT] [--verbose] [--quiet] [--hints]
[--no-hints]
{validate,info,who,prov,inputs,outputs,run,runs,rerun,derived,runtimes}
...
cwlprov explores Research Objects containing provenance of Common Workflow
Language executions. <https://w3id.org/cwl/prov/>
commands:
{validate,info,who,prov,inputs,outputs,run,runs,rerun,derived,runtimes}
validate Validate the CWLProv Research Object
info show research object Metadata
who show Who ran the workflow
prov export workflow execution Provenance in PROV format
inputs list workflow/step Input files/values
outputs list workflow/step Output files/values
run show workflow Execution log
runs List all workflow executions in RO
rerun Rerun a workflow or step
derived list what was Derived from a data item, based on
activity usage/generation
runtimes calculate average step execution Runtimes
(venv3) stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ cwlprov run
2018-08-08 22:44:06.573330 Flow 39408a40-c1c8-4852-9747-87249425be1e [ Run of workflow/packed.cwl#main
2018-08-08 22:44:06.691722 Step 4f082fb6-3e4d-4a21-82e3-c685ce3deb58 Run of workflow/packed.cwl#main/create-tar (0:00:00.010133)
2018-08-08 22:44:06.702976 Step 0cceeaf6-4109-4f08-940b-f06ac959944a * Run of workflow/packed.cwl#main/compile (unknown duration)
2018-08-08 22:44:12.680097 Flow 39408a40-c1c8-4852-9747-87249425be1e ] Run of workflow/packed.cwl#main (0:00:06.106767)
Legend:
[ Workflow start
* Nested provenance, use UUID to explore: cwlprov run 0cceeaf6-4109-4f08-940b-f06ac959944a
] Workflow end
(venv3) stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ cwlprov run 0cceeaf6-4109-4f08-940b-f06ac959944a
2018-08-08 22:44:06.607210 Flow 0cceeaf6-4109-4f08-940b-f06ac959944a [ Run of workflow/packed.cwl#main
2018-08-08 22:44:06.707070 Step 83752ab4-8227-4d4a-8baa-78376df34aed Run of workflow/packed.cwl#main/untar (0:00:00.008149)
2018-08-08 22:44:06.718554 Step f56d8478-a190-4251-84d9-7f69fe0f6f8b Run of workflow/packed.cwl#main/argument (0:00:00.532052)
2018-08-08 22:44:07.251588 Flow 0cceeaf6-4109-4f08-940b-f06ac959944a ] Run of workflow/packed.cwl#main (0:00:00.644378)
Legend:
[ Workflow start
] Workflow end
stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ cwlprov outputs 4f082fb6-3e4d-4a21-82e3-c685ce3deb58 --format=files
Output tar:
data/c0/c0fd5812fe6d8d91fef7f4f1ba3a462500fce0c5
stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ tar tfv `cwlprov -q outputs 4f082fb6-3e4d-4a21-82e3-c685ce3deb58 --format=files`
-rw-r--r-- stain/stain 115 2018-08-08 23:44 Hello.java
Inspecting step runs