PhD Completion Seminar
Presenter: Farah Zaib Khan
Supervisors: AProf. Andrew Lonie, Prof. Richard O. Sinnott
Chair: Prof. Adrian Pearce
This work is licensed under a
Creative Commons Attribution 4.0 International License.
https://slides.com/farahzkhan/phd-seminar
November 28th, 2018
https://slides.com/farahzkhan/phd-seminar
"The description of a process for accomplishing a scientific objective, usually expressed in terms of tasks and their dependencies"
(Ludäscher et al. 2009)
https://slides.com/farahzkhan/phd-seminar
Workflow Life Cycle
4 Stages
Composition, Representation and Data Model
Mapping to Resources
Workflow Execution
Metadata and Provenance
https://slides.com/farahzkhan/phd-seminar
Exponential advances in the technologies and instruments
Declining sequencing cost ( From 2.7 billion USD to 1000USD)
Genomic data produced exponentially (tenfolds/year since 2002)
Now called four-headed beast...
Acquisition, Storage, Distribution, Analysis
https://slides.com/farahzkhan/phd-seminar
https://slides.com/farahzkhan/phd-seminar
https://slides.com/farahzkhan/phd-seminar
https://slides.com/farahzkhan/phd-seminar
Retrospective Provenance
The detailed record of the workflow execution including details of every process together with comprehensive information about the execution environment used to derive a specific data product.
All the details associated with a given workflow run
“Who enacted the workflow?”, “what was used to create a given data artefact?”, “when were the workflow and its processes enacted?”, “Where was the workflow enacted?”.
Prospective Provenance
The ‘recipes’ used to execute a computational task, e.g. the workflow specification.
Workflow Evolution
Tracking and capturing changes in workflow specifications, parameter setting, changes in the underlying software for a workflow step, or altering (adding/removing) a step
https://slides.com/farahzkhan/phd-seminar
Attribution
Quality Assurance
Verification of Results
Debugging in case of failure/error
Reproducibility
Understandability
Reuse
Trust
https://slides.com/farahzkhan/phd-seminar
Automated capture of provenance information (data) to document data dependencies and the derivation process.
Examples: Taverna, Galaxy, Kepler, WINGs, GenePattern, Pegasus, Vistrails, Knime...
https://slides.com/farahzkhan/phd-seminar
https://slides.com/farahzkhan/phd-seminar
Limited understanding of Provenance factors and the associated artefacts - Leading to incomplete provenance
Heterogeneity in Workflow definition & Management - Heterogeneous Provenance capture, granularity & representation
Lack of standard Representation of Workflow-centric Analysis - Results in lack of understanding & interoperability between WMS and computing platforms
Different Provenance Representations resulting in heterogeneity
Customised research-based pipelines supporting individual scenarios
Facilitate bioinformatics workflow-centric research understanding and improve its reproducibility and re-use by analysing existing workflow definition approaches and identifying the fundamental elements of workflow provenance.
Building on this,
(*) The workflow life cycle is adapted from Deelman et al.
(**) The aggregation representation is adapted from these slides
Identification of Key Artefacts of Bioinformatics Workflow Provenance - To improve the understanding and ultimately capture of the provenance
Conceptual Provenance Framework applicable to all - To address the heterogeneity
Standardised Representation of Workflow Enactments - To ensure comprehensive analysis sharing including the resources associated with provenance and achieve interoperability
"different aspects of provenance"
"existing WMS"
Both classifications help in the evaluation of the existing WMSs with respect to their provenance capabilities.
"different aspects of provenance"
"existing WMS"
Provenance Aspects
Provenance Supporting Resources
"different aspects of provenance"
"existing WMS"
"different aspects of provenance"
"existing WMS"
Both classifications help in the evaluation of the existing WMSs with respect to their provenance capabilities.
Variant Calling Workflow
Exemplar Workflow
Test Data (NA12878)
One from each Workflow Approach
This work in published and implementation details are available: https://doi.org/10.1186/s12859-017-1747-0
Note: Different workflow approaches have different assumptions; missing information varies
Factors out of control of an author
Conclusions
Adhering to the following proposed recommendations along with an explicit declaration of workflow specification will result in fine-grained and complete provenance capture.
Conclusions
Adhering to the following proposed recommendations along with an explicit declaration of workflow specification will result in fine-grained and complete provenance capture.
Conclusions
Adhering to the following proposed recommendations along with an explicit declaration of workflow specification will result in fine-grained and complete provenance capture.
Common recommendations, best practices and standard approaches for
workflow design and workflow-centric analyses sharing
focused on improving the applications of provenance
Grouped these recommendations to identify fundamental artefacts and practices crucial for the capture of comprehensive provenance
and support the transparent sharing of workflow-centric studies.
Literature
Method
Outcome
The crucial provenance artefacts that are required in case of bioinformatics workflows
Impact of these artefacts on different applications of provenance
Implications of incomplete documentation of provenance
Level 0 –Trust, Prospective Provenance & Reuse
Level 1 –Retrospective Provenance & Reproducibility
Level 2 –Towards White-box Enactment
Level 3 –Understandability & Specificity
Standardised Format for the representation of a CWL workflow enactment, associated artefacts and the retrospective provenance
Provenance using PROV-Model
expanded with wfprov and wfdesc
Workflow specifications
Adapted from:
https://doi.org/10.5281/zenodo.1484286#page=8
Mechanism for serialization and transport consistency
Step one:
Choose a feature complete reference implementation of CWL
Ideal choice:
Why?
CWLProv is implemented as an optional module to cwltool and can be invoked if required
cwltool --provenance ROname workflow.cwl job.json
Across computing platforms
Across executors
This workflow despite specified in standardised workflow definitions missed "explicit declaration"
non-deterministic algorithm: number of threads –t and the seed length –K
$ cwlprov --help
usage: cwlprov [-h] [--version] [--directory DIRECTORY] [--relative]
[--absolute] [--output OUTPUT] [--verbose] [--quiet] [--hints]
[--no-hints]
{validate,info,who,prov,inputs,outputs,run,runs,rerun,derived,runtimes}
...
cwlprov explores Research Objects containing provenance of Common Workflow
Language executions. <https://w3id.org/cwl/prov/>
commands:
{validate,info,who,prov,inputs,outputs,run,runs,rerun,derived,runtimes}
validate Validate the CWLProv Research Object
info show research object Metadata
who show Who ran the workflow
prov export workflow execution Provenance in PROV format
inputs list workflow/step Input files/values
outputs list workflow/step Output files/values
run show workflow Execution log
runs List all workflow executions in RO
rerun Rerun a workflow or step
derived list what was Derived from a data item, based on
activity usage/generation
runtimes calculate average step execution Runtimes
https://slides.com/soilandreyes/2018-10-29-cwlprov#/13/2
Provenance capture and subsequent use to support published research transparency and integrity should not be treated as an after-thought but rather as a standard practice of utmost priority.
The assumption of black-box provenance often associated with the workflows
and used to justify the coarse-grained provenance of workflow steps should not be encouraged
Conceptual Provenance Framework
CWLProv –Standard Format
Practical Implementation to generate CWLProv ROs
Interoperability Demonstration across different platforms
Supporting Tool Development - Provenance analytics
The empirical case study and its findings are already contributing in shaping recent research - https://doi.org/10.7717/peerj.5551
The conceptual hierarchical provenance framework and CWLProv format are now utilised as a guide by the Nextflow team to implement research object support for the nextflow pipelines - 10.5281/zenodo.1323830
A recent effort focused on extending toil-runner has commenced which will provide support for CWLProv RO generation after workflow enactment.