CWLProv

Retrospective provenance capture and its challenges 

Farah Zaib Khan, Stian Soiland-Reyes, Michael R. Crusoe, Richard O. Sinnott, Andrew Lonie

                                         

 

@farahzk03

https://slides.com/farahzkhan/cwlprov

 

 

Let's begin with some key concepts 

  • Workflows; how are they designed and run?

  • Why use workflows?

  • Provenance

  • Why should we care about Provenance?  

@farahzk03

https://slides.com/farahzkhan/cwlprov

Let's begin with some key concepts 

  • Workflows; how are they designed and run?

  • Why use workflows?

  • Provenance

  • Why should we care about Provenance?  

@farahzk03

https://slides.com/farahzkhan/cwlprov

Workflows

(Esp. scientific workflows)

"The description of a process for accomplishing a scientific objective, usually expressed in terms of tasks and their dependencies"

(Ludäscher et al. 2009)

@farahzk03

https://slides.com/farahzkhan/cwlprov

Short Answer:

In many many different ways!!!!

Long Answer: 

How are workflows designed and run?

@farahzk03

https://slides.com/farahzkhan/cwlprov

The full "incomplete" list contains 215 entries..

And then we have many such lists.

@farahzk03

https://slides.com/farahzkhan/cwlprov

 Three broad categories

Cpipe

@farahzk03

https://slides.com/farahzkhan/cwlprov

Let's begin with some key concepts 

  • Workflows; how are they designed and run?

  • Why use workflows?

  • Provenance

  • Why should we care about Provenance?  

@farahzk03

https://slides.com/farahzkhan/cwlprov

@farahzk03

https://slides.com/farahzkhan/cwlprov

@farahzk03

https://slides.com/farahzkhan/cwlprov

Let's begin with some key concepts 

  • Workflows; how are they designed and run?

  • Why use workflows?

  • Provenance

  • Why should we care about Provenance?  

@farahzk03

https://slides.com/farahzkhan/cwlprov

Provenance

Information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness.

@farahzk03

https://slides.com/farahzkhan/cwlprov

Provenance for Workflows?

Retrospective provenance ​

Formally: The detailed record of the implementation of a computational task including details of every executed process together with comprehensive information about the execution environment used to derive a specific data product.

Not so boring: All the details associated with a given workflow run

(hang on, next slide has a list of "all the details")

Prospective provenance

The ‘recipes’ used to execute a computational task, e.g. the workflow specification.

@farahzk03

https://slides.com/farahzkhan/cwlprov

@farahzk03

https://slides.com/farahzkhan/cwlprov

Let's begin with some key concepts 

  • Workflows; how are they designed and run?

  • Why use workflows?

  • Provenance

  • Why should we care about Provenance?  

@farahzk03

https://slides.com/farahzkhan/cwlprov

@farahzk03

https://slides.com/farahzkhan/cwlprov

Best practices for workflow publishing and sharing

@farahzk03

https://slides.com/farahzkhan/cwlprov

Levels of Provenance

CWLProv

Format for the representation of a CWL workflow run and its retrospective provenance

Keeping in view the best practices and defined levels 

@farahzk03

https://slides.com/farahzkhan/cwlprov

Provenance using PROV-Model, wfprov and wfdesc ontology.

Workflow specifications

Why these standards?? 

Interoperable

Open source

Domain neutral

Community driven

@farahzk03

https://slides.com/farahzkhan/cwlprov

  • Common Workflow Language
  • Research object
  • PROV
  • BagIt
  • wfdesc, wfprov

After finalising the choice of standards and ontologies...

@farahzk03

https://slides.com/farahzkhan/cwlprov

File Structure of a CWLProv Research Object

 

  • data/: input and output data checksums 
  • snapshot/: This directory contains copies of the original workflow and tool specifications files as-is (warning: might contain absolute paths or be host-specific).
  • workflow/: (1) The CWL input object with data/ paths. (2) The workflow in executable format with relativised paths to re-run inside an RO.
  • metadata/: provenance about the workflow run, its data products and manifest for the Research Object.

 

@farahzk03

https://slides.com/farahzkhan/cwlprov

Design of sample provenance profile

@farahzk03

https://slides.com/farahzkhan/cwlprov

And now the implementation!! 

Step one:

Choose a feature complete reference implementation of CWL

Obvious choice:

cwltool

@farahzk03

https://slides.com/farahzkhan/cwlprov

Process diagram for recording provenance

cwltool --provenance ROname workflow.cwl job.json

Which PROV format?

wasGeneratedBy(data:77eecc82607c19910a0f19f55f2c7d5bf7291680, id:da61ed64-6594-4997-981e-9c292366766c, 
        2018-06-05T15:33:58.343902, [prov:role='ex:main/create-tar/tar'])
  "wasGeneratedBy": {
    "_:id6": {
      "prov:entity": "data:77eecc82607c19910a0f19f55f2c7d5bf7291680",
      "prov:activity": "id:da61ed64-6594-4997-981e-9c292366766c",
      "prov:time": "2018-06-05T15:33:58.343902",
      "prov:role": {
        "$": "ex:main/create-tar/tar",
        "type": "prov:QUALIFIED_NAME"
      }
    },

PROV-N

PROV-JSON

How can CWLProv

help you?

Automated capture of methods and data

 effective "sharing" of your analysis within/outside your lab

"publishing" standardized methods along with publication/manuscript submission

Don't care about sharing (you should)? Think about analyzing/re-using your own workflow few months later.. 

Collaborators

Peers

Community

future you

Reviewer

Editor

Reader/end-user

Who is happy ?

What can you achieve?

Who did this? 

When did this happen?

Using what? 

Provenance

Attribution

Accreditation

Quality Assurance

Verification

Debugging

Reproducibility

@farahzk03

https://slides.com/farahzkhan/cwlprov

@farahzk03

https://slides.com/farahzkhan/cwlprov

Is everything perfect? All problems solved? Unicorns exist? 

Sadly No, Like every project, we are also in iterative process of continuous improvement and updates... 

Levels of Provenance

Is everything perfect? All problems solved? Unicorns exist? 

But we have good news..

Big shout out to Nextflow:

https://github.com/edgano/researchObject-Nextflow

We are here for feedback 

Your feedback matters because you are the stake holder!!

Questions? 

We will be here for 4 days of Co-fest, feel free to come have a chat

CWLProv

By Farah Z Khan

CWLProv

CWLProv: Interoperable retrospective provenance capture and its challenges

  • 2,331