Challenges in interoperable provenance capture

Stian Soiland-Reyes

eScience lab, The University of Manchester

RDA-Europe​, Data provenance​ approaches
Barcelona, 2018-01-15 17:30

Scientific Workflows

stain@biggie-mint ~/src/taverna-prov/example $ executeworkflow -embedded \
 -provbundle helloanyone.bundle.zip \
 -inputvalue name fred helloanyone.t2flow 

Provenance bundle zip will be saved to: /home/stain/src/taverna-prov/example/helloanyone.bundle.zip

stain@biggie-mint ~/src/taverna-prov/example $ mkdir helloanyone.bundle ; cd helloanyone.bundle
stain@biggie-mint ~/src/taverna-prov/example/helloanyone.bundle $ unzip ../helloanyone.bundle.zip 
Archive:  ../helloanyone.bundle.zip
 extracting: mimetype                
   creating: inputs/
  inflating: inputs/name.txt         
   creating: outputs/
  inflating: outputs/greeting.txt    
   creating: intermediates/
   creating: intermediates/3a/
  inflating: intermediates/3a/3a82e39d-a537-40cf-91a0-2c89d4a2e62b.txt  
  inflating: workflowrun.prov.ttl    
  inflating: workflow.wfbundle       
   creating: .ro/
   creating: .ro/annotations/
  inflating: .ro/annotations/workflow.wfdesc.ttl  
  inflating: .ro/annotations/a2f03983-8836-4c36-bfb2-d713d9a1928f.ttl  
  inflating: .ro/manifest.json
cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
  ex: string

outputs:
  classout:
    type: File
    outputSource: compile/classfile

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: inp
      extractfile: ex
    out: [example_out]

  compile:
    run: arguments.cwl
    in:
      src: untar/example_out
    out: [classfile]

https://doi.org/10.7490/f1000research.1114781.1

Farah Z Khan
BOSC hackathon 2017

Prototype PROV+RO export
CWL reference implementation

 

Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.

PROV Model Primer

W3C Working Group Note 30 April 2013

Which PROV format?

<prov:wasGeneratedBy>
  <prov:entity prov:ref="ex:ent1"/>
  <prov:activity prov:ref="ex:act1"/>
  <prov:time>2017-10-26T21:32:52Z</prov:time>
  <ex:port>p1</ex:port>
</prov:wasGeneratedBy>
wasGeneratedBy(ent1, act1, 
  2017-10-26T21:32:52Z, ex:port="p1")
:ent1
  a prov:Entity;
  prov:wasGeneratedBy :act1;
  prov:generatedAtTime "2017-10-26T21:32:52Z"^^xsd:dateTime ;
  ex:port "p1" .
    "wasGeneratedBy": {
        "ex:gen1": {
            "prov:entity": "ent1",
            "prov:activity": "act1",
            "prov:time": "2017-10-26T21:32:52Z",
            "ex:port": "p1"
        },
    },
{ "@context": { .. }, 
  "@id": "ent1",
  "@type": "prov:Entity",
  "ex:port": "p1",
  "prov:generatedAtTime":  "2017-10-26T21:32:52Z",
  "prov:wasGeneratedBy": {
    "@id": "act1",
    "@type": "prov:Activity"
  } 
}

PROV-N

PROV-XML

PROV-JSON

PROV-O Turtle

PROV-O JSON-LD

Tooling to the rescue

How to identify the workflow?

Permalink URI scheme

https://w3id.org/cwl/view/{scheme}/{commit}/{path}#{fragment}
  • https://w3id.org/cwl/view/ fixed prefix at permalink service https://w3id.org/
  • {scheme} - source code management protocol, currently only git supported:
    • {commit} - full git commit sha1 id (no branches or short commits allowed)
    • {path} - relative path to .cwl file within a checkout of that git commit
    • #{fragment} - optional part within CWL file , e.g. #main

Any git permalinks are resolved using https://view.commonwl.org/git which - if it knows about that particular git commit - will content-negotiate to provide various representations.

Anyone can mint these permalinks for .cwl files for a given commit, in any public or private git repository, given no uncommitted files or git submodules.

wasAssociatedWith(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 
      engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, 
      https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/hello/hello.cwl#main)

agent(engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, 
      prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', 
      prov:label="cwltool v1.2.5")

A workflow definition (prospective provenance)
can be executed multiple times (retrospective provenance)

(and on different machines)

workflow definition - a recipe (prov:Plan)

workflow instance - a recipe, fully configured to run (prov:Plan)

workflow run - an execution of a workflow instance (prov:Activity)

Run a Command line tool


cwlVersion: v1.0
class: CommandLineTool

baseCommand: [ esl-sfetch, --index ]


inputs:
  sequences:
    type: File
    inputBinding:
      position: 1
      valueFrom: $(self.basename)

Where does esl-sfetch come from?
Which version? How is it configured?

Containers to the rescue

Step in workflow ~= tool execution?

Scatter/gather

To use scatter/gather, ScatterFeatureRequirement must be specified in the workflow or workflow step requirements.

A "scatter" operation specifies that the associated workflow step or subworkflow should execute separately over a list of input elements. Each job making up a scatter operation is independent and may be executed concurrently.

  • dotproduct specifies that each of the input arrays are aligned and one element taken from each array to construct each job. It is an error if all input arrays are not the same length.

  • nested_crossproduct specifies the Cartesian product of the inputs, producing a job for every combination of the scattered inputs. The output must be nested arrays for each level of scattering, in the order that the input arrays are listed in the scatter field.

  • flat_crossproduct specifies the Cartesian product of the inputs, producing a job for every combination of the scattered inputs. The output arrays must be flattened to a single level, but otherwise listed in the order that the input arrays are listed in the scatter field.

3.5 Expressions

An expression is a fragment of Javascript/ECMAScript 5.1 code evaluated by the workflow latform to affect the inputs, outputs, or behavior of a process.

 

Expressions are denoted by the syntax $(...) or ${...}.
A code fragment wrapped in the $(...) syntax must be evaluated as a ECMAScript expression.
A code fragment wrapped in the ${...} syntax must be evaluated as a ECMAScript function body for an anonymous, zero-argument function.
Expressions must return a valid JSON data type: one of null, string, number, boolean, array, object. 

Conditional branching

  step2:
    in: [threshold]
    out: [out]
    switch:
      "$(inputs.threshold > 2)": high.cwl
      "$(inputs.threshold == 1)": low.cwl
      default:
        result:
          out: 0

Nested workflows

A single activity unrolled to multiple steps

 

 

activity(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, , , 
   [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])    
    // main workflow run started outside somehow (we're don't know how)
    wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, 
                 -, 2017-10-27T15:00:00Z)
    // ...
    // step is a nested workflow, so also a WorkflowRun
    activity(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, 
      [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main/nested1"])
        // started by the mother activity
        wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, 
                     run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:30Z)
    
        // inner step of nested workflow, ProcessRun as this is a command line execution
        activity(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, - 
          [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#nested/innerStep1"])
            
        wasStartedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, 
                     run:4305467e-6dfb-11e7-885d-0242ac110002, 2017-10-27T15:01:00Z)
        // ...

Identifying intermediate data

Output 1B file is also Input 2C and Input 3D downstream

Simple filenames -> duplications

./data/step1/outputB.txt 
./data/step2/inputC.txt
./data/step3/inputD.txt

 

Content-adressable

SHA-256 hash of bytes as filename:

./data/51/51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d

RFC6920 URI as global identifier:

nih:sha-256;51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d

All-in-one prov trace -> messy

 

identifiers mismatch (e.g. "step1" both in #main and #nested)

Multiple wasGeneratedBy for same entity

Do everyone need to understand execution hierarchy?

Multiple workflow PROV profiles combined?

prov:alternateOf

Relating global identifier to local paths

used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 
     data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 
     2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))

entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)

   // which we have stored a copy of within the research object
   specializationOf(./data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03,
                    data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)

. but what about multiple workflows creating the same bytes?

e.g. when was the empty string generated?

used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 
     urn:uuid:f940c301-46fd-4a6b-808d-d6beed700f3a, 
     2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))

used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 
     urn:uuid:63a0ff1b-45c6-41cb-97bf-2da7aa93ec0f, 
     2017-10-27T14:29:00+01:05, [prov:role='wf:main/input2']))

entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
   // Different UUID for each occurrence
   specializationOf(urn:uuid:f940c301-46fd-4a6b-808d-d6beed700f3a,
                    data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
   specializationOf(urn:uuid:63a0ff1b-45c6-41cb-97bf-2da7aa93ec0f
                    data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
   // Also available as bytes in research Object
   specializationOf(./data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03,
                    data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)

What about multiple workflows creating the same bytes?

When was the empty string generated?

 

Workaround: "virtual" entity for every role / activity occurance

Simplify:
multiple PROV files

Different "world views" of what happened

metadata/provenance/2e1287e0-6dfb-11e7-8acf-0242ac110002.prov.jsonld

metadata/provenance/4305467e-6dfb-11e7-885d-0242ac110002.prov.jsonld

metadata/provenance/c42dc36e-6dfd-11e7-bc24-0242ac110002.prov.jsonld
     Bonus: Obvious slot for tool-specific provenance

 

Prospective provenance? UUIDv5 hash of permalink

metadata/prospective/39ab126a-e0c9-5cac-a67e-2b7fdb8ad25f/cwl.ttl

metadata/prospective/39ab126a-e0c9-5cac-a67e-2b7fdb8ad25f/wfdesc.ttl

metadata/prospective/39ab126a-e0c9-5cac-a67e-2b7fdb8ad25f/pplan.ttl

Workflow provenance profiles

How to tie it all together?

id:        doi:10.15490/seek.1.investigation.56
createdOn: 2015-07-10T16:46:00Z
createdBy: http://orcid.org/0000-0001-9842-9718

aggregates:
 - id:         data/sequence/specimen5.bam
   conformsTo: http://gemrb.org/iesdp/file_formats/ie_formats/bam_v1.htm

 - id:         http://example.com/blog/about-specimen5
   authoredBy: http://orcid.org/0000-0001-7066-3350

 - id:         http://www.myexperiment.org/workflows/3355
   history:    provenance/workflow-evolution.ttl

annotations:
 - about:       data/sequence/specimen5.bam
   content:     annotations/specimen5-properties.jsonld
   createdBy:   http://orcid.org/0000-0001-7066-3350

 - about:       data/sequence/specimen5.bam
   content:     http://example.com/blog/about-specimen5
   motivatedBy: oa:questioning

Research Object manifest

(simplified)

Reuse standards:
OAI-ORE, BagIt, W3C JSON-LD, PROV, Web Annotation Model

metadata/manifest.json
data/sequence/specimen5.bam
provenance/workflow-evolution.ttl
http://example.com/blog/about-specimen5
http://www.myexperiment.org/workflows/335

http://orcid.org/0000-0001-7066-3350
http://gemrb.org/iesdb/
   file_formats_ie_formats_bam_v1.html

Who is using Research Objects?

Structure of CWL run Research Object:

 

  • data: content-adressable by sha256 hash
  • workflow: input object (json file) with relativised paths, packed.cwl
    executable workflow containing the workflow specification and tool specifications with relativised paths to re-run inside an RO.
  • snapshot: This directory contains copies of the original workflow and tool specifications files as-is (warning: might contain absolute paths or be host-specific).
  • metadata: provenance about the workflow run, its data products and manifest for this Research Object.

 

document
prefix wfprov <http://purl.org/wf4ever/wfprov#>
prefix prov <http://www.w3.org/ns/prov#>
prefix wfdesc <http://purl.org/wf4ever/wfdesc#>
prefix wf <https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/hello/hello.cwl#>
prefix input <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/workflow/input.json#>
prefix run <urn:uuid:>
prefix engine <urn:uuid:>
prefix data <nih:sha-256;>

default <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/>

// Level 1 provenance of workflow run

activity(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, , , [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])    
    wasStartedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, -, 2017-10-27T14:24:00+01:00)  

    // The engine is the SoftwareAgent that is executing our Workflow plan
    wasAssociatedWith(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main)
        agent(engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool v1.2.5")
        // prov has no term to relate sub-plans - we'll use wfdesc:hasSubProcess
        entity(wf:main,[prov:type='wfdesc:Workflow', prov:type='prov:Plan', wfdesc:hasSubProcess='wf:main/step1', wfdesc:hasSubProcess='wf:main/step2'])
            alternateOf(wf:main, workflow/packed.cwl)
            entity(wf:main/step1,[prov:type='wfdesc:Process', prov:type='prov:Plan'])
            entity(wf:main/step2,[prov:type='wfdesc:Process', prov:type='prov:Plan'])            

    // First the workflow uses some data; here with a sha256 identifier
    used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))
        entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, [prov:type='wfprov:Artifact'])
            // which we have stored a copy of within the research object
            specializationOf(data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)

    // Then there was another activity - wfprov:ProcessRun indicating a command line tool
    activity(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step1"])
        // started by the mother activity
        wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00)
        // same engine using step1 as plan. In a distributed scenario there might be a different engine
        wasAssociatedWith(run:4305467e-6dfb-11e7-885d-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step1)
        // This activity also use the same data, but in a different role (e.g. input parameter)
        used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:00:00+01:00, [prov:role='wf:main/step1/in1'])

        // And we generate some new data
        wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:4305467e-6dfb-11e7-885d-0242ac110002, 2017-10-27T16:00:00+01:00, [prov:role='wf:main/step1/out1']))
            entity(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, [prov:type='wfprov:Artifact'])
                // again stored in the RO
                specializationOf(data/00/00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c)

        // step1 finished
        wasEndedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:30:00+01:00)

    // the master workflow then "generate" that same value, but now at a different time and role (the resultA master workflow output)
    wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultA'])

    // next step activity
    activity(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, - [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step2"])
        wasStartedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:00:00+01:00)
        // associated with step2
        wasAssociatedWith(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step2)
        
        // Uses two data artifacts; one which came from previous step, other as workflow input
        used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueA'])
        used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueB'])
        
        // and generate two new data artifacts
        wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:c42dc36e-6dfd-11e7-bc24-0242ac110002,  2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out1'])))
            entity(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, [prov:type='wfprov:Artifact'])
                specializationOf(data/95/2f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d)

        wasGeneratedBy(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, run:c42dc36e-6dfd-11e7-bc24-0242ac110002,  2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out2'])))
            entity(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, [prov:type='wfprov:Artifact'])
                specializationOf(data/3d/eb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0)
        // step2 ends
        wasEndedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:30:00+01:00)

    // only step output out1 captured by mother workflow, sent to resultB workflow output
    wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultB'])

    // mother workflow ends
    wasEndedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:34:40+01:00)

endDocument

2018-01-15 Challenges in interoperable provenance capture with Common Workflow Language and Research Objects

By Stian Soiland-Reyes

2018-01-15 Challenges in interoperable provenance capture with Common Workflow Language and Research Objects

Presented at RDA-Europe meeting on Data provenance approaches

  • 3,330