Challenges in interoperable provenance capture
Stian Soiland-Reyes
eScience lab, The University of Manchester
RDA-Europe, Data provenance approaches
Barcelona, 2018-01-15 17:30
This work is licensed under a
Creative Commons Attribution 4.0 International License.
Scientific Workflows
stain@biggie-mint ~/src/taverna-prov/example $ executeworkflow -embedded \
-provbundle helloanyone.bundle.zip \
-inputvalue name fred helloanyone.t2flow
Provenance bundle zip will be saved to: /home/stain/src/taverna-prov/example/helloanyone.bundle.zip
stain@biggie-mint ~/src/taverna-prov/example $ mkdir helloanyone.bundle ; cd helloanyone.bundle
stain@biggie-mint ~/src/taverna-prov/example/helloanyone.bundle $ unzip ../helloanyone.bundle.zip
Archive: ../helloanyone.bundle.zip
extracting: mimetype
creating: inputs/
inflating: inputs/name.txt
creating: outputs/
inflating: outputs/greeting.txt
creating: intermediates/
creating: intermediates/3a/
inflating: intermediates/3a/3a82e39d-a537-40cf-91a0-2c89d4a2e62b.txt
inflating: workflowrun.prov.ttl
inflating: workflow.wfbundle
creating: .ro/
creating: .ro/annotations/
inflating: .ro/annotations/workflow.wfdesc.ttl
inflating: .ro/annotations/a2f03983-8836-4c36-bfb2-d713d9a1928f.ttl
inflating: .ro/manifest.json
cwlVersion: v1.0
class: Workflow
inputs:
inp: File
ex: string
outputs:
classout:
type: File
outputSource: compile/classfile
steps:
untar:
run: tar-param.cwl
in:
tarfile: inp
extractfile: ex
out: [example_out]
compile:
run: arguments.cwl
in:
src: untar/example_out
out: [classfile]
https://doi.org/10.7490/f1000research.1114781.1
Farah Z Khan
BOSC hackathon 2017
Prototype PROV+RO export
CWL reference implementation
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
PROV Model Primer
W3C Working Group Note 30 April 2013
Which PROV format?
<prov:wasGeneratedBy>
<prov:entity prov:ref="ex:ent1"/>
<prov:activity prov:ref="ex:act1"/>
<prov:time>2017-10-26T21:32:52Z</prov:time>
<ex:port>p1</ex:port>
</prov:wasGeneratedBy>
wasGeneratedBy(ent1, act1,
2017-10-26T21:32:52Z, ex:port="p1")
:ent1
a prov:Entity;
prov:wasGeneratedBy :act1;
prov:generatedAtTime "2017-10-26T21:32:52Z"^^xsd:dateTime ;
ex:port "p1" .
"wasGeneratedBy": {
"ex:gen1": {
"prov:entity": "ent1",
"prov:activity": "act1",
"prov:time": "2017-10-26T21:32:52Z",
"ex:port": "p1"
},
},
{ "@context": { .. },
"@id": "ent1",
"@type": "prov:Entity",
"ex:port": "p1",
"prov:generatedAtTime": "2017-10-26T21:32:52Z",
"prov:wasGeneratedBy": {
"@id": "act1",
"@type": "prov:Activity"
}
}
PROV-N
PROV-XML
PROV-JSON
PROV-O Turtle
PROV-O JSON-LD
Tooling to the rescue
How to identify the workflow?
Permalink URI scheme
https://w3id.org/cwl/view/{scheme}/{commit}/{path}#{fragment}
- https://w3id.org/cwl/view/ fixed prefix at permalink service https://w3id.org/
- {scheme} - source code management protocol, currently only git supported:
- {commit} - full git commit sha1 id (no branches or short commits allowed)
- {path} - relative path to .cwl file within a checkout of that git commit
- #{fragment} - optional part within CWL file , e.g. #main
Any git permalinks are resolved using https://view.commonwl.org/git which - if it knows about that particular git commit - will content-negotiate to provide various representations.
Anyone can mint these permalinks for .cwl files for a given commit, in any public or private git repository, given no uncommitted files or git submodules.
wasAssociatedWith(run:2e1287e0-6dfb-11e7-8acf-0242ac110002,
engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f,
https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/hello/hello.cwl#main)
agent(engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f,
prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine',
prov:label="cwltool v1.2.5")
A workflow definition (prospective provenance)
can be executed multiple times (retrospective provenance)
(and on different machines)
workflow definition - a recipe (prov:Plan)
workflow instance - a recipe, fully configured to run (prov:Plan)
workflow run - an execution of a workflow instance (prov:Activity)
Run a Command line tool
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [ esl-sfetch, --index ]
inputs:
sequences:
type: File
inputBinding:
position: 1
valueFrom: $(self.basename)
Where does esl-sfetch come from?
Which version? How is it configured?
Containers to the rescue
Step in workflow ~= tool execution?
Scatter/gather
To use scatter/gather, ScatterFeatureRequirement must be specified in the workflow or workflow step requirements.
A "scatter" operation specifies that the associated workflow step or subworkflow should execute separately over a list of input elements. Each job making up a scatter operation is independent and may be executed concurrently.
-
dotproduct specifies that each of the input arrays are aligned and one element taken from each array to construct each job. It is an error if all input arrays are not the same length.
-
nested_crossproduct specifies the Cartesian product of the inputs, producing a job for every combination of the scattered inputs. The output must be nested arrays for each level of scattering, in the order that the input arrays are listed in the scatter field.
-
flat_crossproduct specifies the Cartesian product of the inputs, producing a job for every combination of the scattered inputs. The output arrays must be flattened to a single level, but otherwise listed in the order that the input arrays are listed in the scatter field.
3.5 Expressions
An expression is a fragment of Javascript/ECMAScript 5.1 code evaluated by the workflow latform to affect the inputs, outputs, or behavior of a process.
Expressions are denoted by the syntax $(...) or ${...}.
A code fragment wrapped in the $(...) syntax must be evaluated as a ECMAScript expression.
A code fragment wrapped in the ${...} syntax must be evaluated as a ECMAScript function body for an anonymous, zero-argument function.
Expressions must return a valid JSON data type: one of null, string, number, boolean, array, object.
Conditional branching
step2:
in: [threshold]
out: [out]
switch:
"$(inputs.threshold > 2)": high.cwl
"$(inputs.threshold == 1)": low.cwl
default:
result:
out: 0
Nested workflows
A single activity unrolled to multiple steps
activity(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, , ,
[prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
// main workflow run started outside somehow (we're don't know how)
wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -,
-, 2017-10-27T15:00:00Z)
// ...
// step is a nested workflow, so also a WorkflowRun
activity(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -,
[prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main/nested1"])
// started by the mother activity
wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -,
run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:30Z)
// inner step of nested workflow, ProcessRun as this is a command line execution
activity(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -
[prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#nested/innerStep1"])
wasStartedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -,
run:4305467e-6dfb-11e7-885d-0242ac110002, 2017-10-27T15:01:00Z)
// ...
Identifying intermediate data
Output 1B file is also Input 2C and Input 3D downstream
Simple filenames -> duplications
./data/step1/outputB.txt
./data/step2/inputC.txt
./data/step3/inputD.txt
Content-adressable
SHA-256 hash of bytes as filename:
./data/51/51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d
RFC6920 URI as global identifier:
nih:sha-256;51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d
All-in-one prov trace -> messy
identifiers mismatch (e.g. "step1" both in #main and #nested)
Multiple wasGeneratedBy for same entity
Do everyone need to understand execution hierarchy?
Multiple workflow PROV profiles combined?
prov:alternateOf
Relating global identifier to local paths
used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002,
data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03,
2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))
entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
// which we have stored a copy of within the research object
specializationOf(./data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03,
data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
. but what about multiple workflows creating the same bytes?
e.g. when was the empty string generated?
used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002,
urn:uuid:f940c301-46fd-4a6b-808d-d6beed700f3a,
2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))
used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002,
urn:uuid:63a0ff1b-45c6-41cb-97bf-2da7aa93ec0f,
2017-10-27T14:29:00+01:05, [prov:role='wf:main/input2']))
entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
// Different UUID for each occurrence
specializationOf(urn:uuid:f940c301-46fd-4a6b-808d-d6beed700f3a,
data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
specializationOf(urn:uuid:63a0ff1b-45c6-41cb-97bf-2da7aa93ec0f
data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
// Also available as bytes in research Object
specializationOf(./data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03,
data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
What about multiple workflows creating the same bytes?
When was the empty string generated?
Workaround: "virtual" entity for every role / activity occurance
Simplify:
multiple PROV files
Different "world views" of what happened
metadata/provenance/2e1287e0-6dfb-11e7-8acf-0242ac110002.prov.jsonld
metadata/provenance/4305467e-6dfb-11e7-885d-0242ac110002.prov.jsonld
metadata/provenance/c42dc36e-6dfd-11e7-bc24-0242ac110002.prov.jsonld
Bonus: Obvious slot for tool-specific provenance
Prospective provenance? UUIDv5 hash of permalink
metadata/prospective/39ab126a-e0c9-5cac-a67e-2b7fdb8ad25f/cwl.ttl
metadata/prospective/39ab126a-e0c9-5cac-a67e-2b7fdb8ad25f/wfdesc.ttl
metadata/prospective/39ab126a-e0c9-5cac-a67e-2b7fdb8ad25f/pplan.ttl
Workflow provenance profiles
How to tie it all together?
id: doi:10.15490/seek.1.investigation.56
createdOn: 2015-07-10T16:46:00Z
createdBy: http://orcid.org/0000-0001-9842-9718
aggregates:
- id: data/sequence/specimen5.bam
conformsTo: http://gemrb.org/iesdp/file_formats/ie_formats/bam_v1.htm
- id: http://example.com/blog/about-specimen5
authoredBy: http://orcid.org/0000-0001-7066-3350
- id: http://www.myexperiment.org/workflows/3355
history: provenance/workflow-evolution.ttl
annotations:
- about: data/sequence/specimen5.bam
content: annotations/specimen5-properties.jsonld
createdBy: http://orcid.org/0000-0001-7066-3350
- about: data/sequence/specimen5.bam
content: http://example.com/blog/about-specimen5
motivatedBy: oa:questioning
Research Object manifest
(simplified)
Reuse standards:
OAI-ORE, BagIt, W3C JSON-LD, PROV, Web Annotation Model
metadata/manifest.json
data/sequence/specimen5.bam
provenance/workflow-evolution.ttl
http://example.com/blog/about-specimen5
http://www.myexperiment.org/workflows/335
http://orcid.org/0000-0001-7066-3350
http://gemrb.org/iesdb/
file_formats_ie_formats_bam_v1.html
Who is using Research Objects?
Structure of CWL run Research Object:
- data: content-adressable by sha256 hash
-
workflow: input object (json file) with relativised paths, packed.cwl
executable workflow containing the workflow specification and tool specifications with relativised paths to re-run inside an RO. - snapshot: This directory contains copies of the original workflow and tool specifications files as-is (warning: might contain absolute paths or be host-specific).
- metadata: provenance about the workflow run, its data products and manifest for this Research Object.
document
prefix wfprov <http://purl.org/wf4ever/wfprov#>
prefix prov <http://www.w3.org/ns/prov#>
prefix wfdesc <http://purl.org/wf4ever/wfdesc#>
prefix wf <https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/hello/hello.cwl#>
prefix input <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/workflow/input.json#>
prefix run <urn:uuid:>
prefix engine <urn:uuid:>
prefix data <nih:sha-256;>
default <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/>
// Level 1 provenance of workflow run
activity(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, , , [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
wasStartedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, -, 2017-10-27T14:24:00+01:00)
// The engine is the SoftwareAgent that is executing our Workflow plan
wasAssociatedWith(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main)
agent(engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool v1.2.5")
// prov has no term to relate sub-plans - we'll use wfdesc:hasSubProcess
entity(wf:main,[prov:type='wfdesc:Workflow', prov:type='prov:Plan', wfdesc:hasSubProcess='wf:main/step1', wfdesc:hasSubProcess='wf:main/step2'])
alternateOf(wf:main, workflow/packed.cwl)
entity(wf:main/step1,[prov:type='wfdesc:Process', prov:type='prov:Plan'])
entity(wf:main/step2,[prov:type='wfdesc:Process', prov:type='prov:Plan'])
// First the workflow uses some data; here with a sha256 identifier
used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))
entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, [prov:type='wfprov:Artifact'])
// which we have stored a copy of within the research object
specializationOf(data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
// Then there was another activity - wfprov:ProcessRun indicating a command line tool
activity(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step1"])
// started by the mother activity
wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00)
// same engine using step1 as plan. In a distributed scenario there might be a different engine
wasAssociatedWith(run:4305467e-6dfb-11e7-885d-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step1)
// This activity also use the same data, but in a different role (e.g. input parameter)
used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:00:00+01:00, [prov:role='wf:main/step1/in1'])
// And we generate some new data
wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:4305467e-6dfb-11e7-885d-0242ac110002, 2017-10-27T16:00:00+01:00, [prov:role='wf:main/step1/out1']))
entity(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, [prov:type='wfprov:Artifact'])
// again stored in the RO
specializationOf(data/00/00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c)
// step1 finished
wasEndedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:30:00+01:00)
// the master workflow then "generate" that same value, but now at a different time and role (the resultA master workflow output)
wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultA'])
// next step activity
activity(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, - [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step2"])
wasStartedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:00:00+01:00)
// associated with step2
wasAssociatedWith(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step2)
// Uses two data artifacts; one which came from previous step, other as workflow input
used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueA'])
used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueB'])
// and generate two new data artifacts
wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:c42dc36e-6dfd-11e7-bc24-0242ac110002, 2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out1'])))
entity(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, [prov:type='wfprov:Artifact'])
specializationOf(data/95/2f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d)
wasGeneratedBy(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, run:c42dc36e-6dfd-11e7-bc24-0242ac110002, 2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out2'])))
entity(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, [prov:type='wfprov:Artifact'])
specializationOf(data/3d/eb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0)
// step2 ends
wasEndedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:30:00+01:00)
// only step output out1 captured by mother workflow, sent to resultB workflow output
wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultB'])
// mother workflow ends
wasEndedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:34:40+01:00)
endDocument
2018-01-15 Challenges in interoperable provenance capture with Common Workflow Language and Research Objects
By Farah Z Khan
2018-01-15 Challenges in interoperable provenance capture with Common Workflow Language and Research Objects
Presented at RDA-Europe meeting on Data provenance approaches
- 891