CWLProv
Retrospective provenance capture and its challenges
Farah Zaib Khan, Stian Soiland-Reyes, Michael R. Crusoe, Richard O. Sinnott, Andrew Lonie
This work is licensed under
Let's begin with some key concepts
-
Workflows; how are they designed and run?
-
Why use workflows?
-
Provenance
-
Why should we care about Provenance?
https://slides.com/farahzkhan/cwlprov
Let's begin with some key concepts
-
Workflows; how are they designed and run?
-
Why use workflows?
-
Provenance
-
Why should we care about Provenance?
https://slides.com/farahzkhan/cwlprov
Workflows
(Esp. scientific workflows)
"The description of a process for accomplishing a scientific objective, usually expressed in terms of tasks and their dependencies"
(Ludäscher et al. 2009)
https://slides.com/farahzkhan/cwlprov
Short Answer:
In many many different ways!!!!
Long Answer:
How are workflows designed and run?
https://slides.com/farahzkhan/cwlprov
The full "incomplete" list contains 215 entries..
And then we have many such lists.
https://slides.com/farahzkhan/cwlprov
Three broad categories
Cpipe
https://slides.com/farahzkhan/cwlprov
Let's begin with some key concepts
-
Workflows; how are they designed and run?
-
Why use workflows?
-
Provenance
-
Why should we care about Provenance?
https://slides.com/farahzkhan/cwlprov
https://slides.com/farahzkhan/cwlprov
https://slides.com/farahzkhan/cwlprov
Let's begin with some key concepts
-
Workflows; how are they designed and run?
-
Why use workflows?
-
Provenance
-
Why should we care about Provenance?
https://slides.com/farahzkhan/cwlprov
Provenance
Information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.
https://slides.com/farahzkhan/cwlprov
Provenance for Workflows?
Retrospective provenance
Formally: The detailed record of the implementation of a computational task including details of every executed process together with comprehensive information about the execution environment used to derive a specific data product.
Not so boring: All the details associated with a given workflow run
(hang on, next slide has a list of "all the details")
Prospective provenance
The ‘recipes’ used to execute a computational task, e.g. the workflow specification.
https://slides.com/farahzkhan/cwlprov
https://slides.com/farahzkhan/cwlprov
Let's begin with some key concepts
-
Workflows; how are they designed and run?
-
Why use workflows?
-
Provenance
-
Why should we care about Provenance?
https://slides.com/farahzkhan/cwlprov
https://slides.com/farahzkhan/cwlprov
Best practices for workflow publishing and sharing
https://slides.com/farahzkhan/cwlprov
Levels of Provenance
CWLProv
Format for the representation of a CWL workflow run and its retrospective provenance
Keeping in view the best practices and defined levels
https://slides.com/farahzkhan/cwlprov
Provenance using PROV-Model, wfprov and wfdesc ontology.
Workflow specifications
Why these standards??
Interoperable
Open source
Domain neutral
Community driven
https://slides.com/farahzkhan/cwlprov
- Common Workflow Language
- Research object
- PROV
- BagIt
- wfdesc, wfprov
After finalising the choice of standards and ontologies...
https://slides.com/farahzkhan/cwlprov
File Structure of a CWLProv Research Object
- data/: input and output data checksums
- snapshot/: This directory contains copies of the original workflow and tool specifications files as-is (warning: might contain absolute paths or be host-specific).
- workflow/: (1) The CWL input object with data/ paths. (2) The workflow in executable format with relativised paths to re-run inside an RO.
- metadata/: provenance about the workflow run, its data products and manifest for the Research Object.
https://slides.com/farahzkhan/cwlprov
Design of sample provenance profile
https://slides.com/farahzkhan/cwlprov
And now the implementation!!
Step one:
Choose a feature complete reference implementation of CWL
Obvious choice:
cwltool
https://slides.com/farahzkhan/cwlprov
Process diagram for recording provenance
cwltool --provenance ROname workflow.cwl job.json
Which PROV format?
wasGeneratedBy(data:77eecc82607c19910a0f19f55f2c7d5bf7291680, id:da61ed64-6594-4997-981e-9c292366766c,
2018-06-05T15:33:58.343902, [prov:role='ex:main/create-tar/tar'])
"wasGeneratedBy": {
"_:id6": {
"prov:entity": "data:77eecc82607c19910a0f19f55f2c7d5bf7291680",
"prov:activity": "id:da61ed64-6594-4997-981e-9c292366766c",
"prov:time": "2018-06-05T15:33:58.343902",
"prov:role": {
"$": "ex:main/create-tar/tar",
"type": "prov:QUALIFIED_NAME"
}
},
PROV-N
PROV-JSON
How can CWLProv
help you?
Automated capture of methods and data
effective "sharing" of your analysis within/outside your lab
"publishing" standardized methods along with publication/manuscript submission
Don't care about sharing (you should)? Think about analyzing/re-using your own workflow few months later..
Collaborators
Peers
Community
future you
Reviewer
Editor
Reader/end-user
Who is happy ?
What can you achieve?
Who did this?
When did this happen?
Using what?
Provenance
Attribution
Accreditation
Quality Assurance
Verification
Debugging
Reproducibility
https://slides.com/farahzkhan/cwlprov
https://slides.com/farahzkhan/cwlprov
Is everything perfect? All problems solved? Unicorns exist?
Sadly No, Like every project, we are also in iterative process of continuous improvement and updates...
Levels of Provenance
Is everything perfect? All problems solved? Unicorns exist?
But we have good news..
Big shout out to Nextflow:
We are here for feedback
Your feedback matters because you are the stake holder!!
Questions?
We will be here for 4 days of Co-fest, feel free to come have a chat
CWLProv
By Farah Z Khan
CWLProv
CWLProv: Interoperable retrospective provenance capture and its challenges
- 2,499