Stian Soiland-Reyes
eScience lab, The University of Manchester
INDElab, University of Amsterdam
EOSC-Life retreat 2021
Provenance of tools and workflows; FAIRification of workflows
2021-05-19
This work is licensed under a
Creative Commons Attribution 4.0 International License.
They ride with what I refer to as the four horsemen of the reproducibility apocalypse:
Reproducibility?
Semantic Web world vs Real World
Peter Sefton at Open Repositories 2019
https://eresearch.uts.edu.au/2019/07/01/DataCrate-OR2019.htm
16k RO-Crates underneath the hood:
Capturing workflow provenance in a research object
CWLProv explained by example:
Transfer: BagIt
Manifest: ORE/RO JSON-LD
Workflow description: wfdesc (Turtle)
Workflow run (PROV +wfprov)
Workflow definition: CWL
Tool interoperability: Docker
Data: Content-adressable files
document
prefix wfprov <http://purl.org/wf4ever/wfprov#>
prefix prov <http://www.w3.org/ns/prov#>
prefix wfdesc <http://purl.org/wf4ever/wfdesc#>
prefix wf <https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/hello/hello.cwl#>
prefix input <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/workflow/input.json#>
prefix run <urn:uuid:>
prefix engine <urn:uuid:>
prefix data <urn:hash:sha256:>
default <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/>
// Level 1 provenance of workflow run
activity(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, , , [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
wasStartedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, -, 2017-10-27T14:24:00+01:00)
// The engine is the SoftwareAgent that is executing our Workflow plan
wasAssociatedWith(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main)
agent(engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool v1.2.5")
// prov has no term to relate sub-plans - we'll use wfdesc:hasSubProcess
entity(wf:main,[prov:type='wfdesc:Workflow', prov:type='prov:Plan', wfdesc:hasSubProcess='wf:main/step1', wfdesc:hasSubProcess='wf:main/step2'])
alternateOf(wf:main, workflow/packed.cwl)
entity(wf:main/step1,[prov:type='wfdesc:Process', prov:type='prov:Plan'])
entity(wf:main/step2,[prov:type='wfdesc:Process', prov:type='prov:Plan'])
// First the workflow uses some data; here with a urn:sha:sha256 identifier
used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))
entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, [prov:type='wfprov:Artifact'])
// which we have stored a copy of within the research object
specializationOf(data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)
// Then there was another activity - wfprov:ProcessRun indicating a command line tool
activity(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step1"])
// started by the mother activity
wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00)
// same engine using step1 as plan. In a distributed scenario there might be a different engine
wasAssociatedWith(run:4305467e-6dfb-11e7-885d-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step1)
// This activity also use the same data, but in a different role (e.g. input parameter)
used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:00:00+01:00, [prov:role='wf:main/step1/in1'])
// And we generate some new data
wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:4305467e-6dfb-11e7-885d-0242ac110002, 2017-10-27T16:00:00+01:00, [prov:role='wf:main/step1/out1']))
entity(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, [prov:type='wfprov:Artifact'])
// again stored in the RO
specializationOf(data/00/00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c)
// step1 finished
wasEndedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:30:00+01:00)
// the master workflow then "generate" that same value, but now at a different time and role (the resultA master workflow output)
wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultA'])
// next step activity
activity(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, - [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step2"])
wasStartedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:00:00+01:00)
// associated with step2
wasAssociatedWith(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step2)
// Uses two data artifacts; one which came from previous step, other as workflow input
used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueA'])
used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueB'])
// and generate two new data artifacts
wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:c42dc36e-6dfd-11e7-bc24-0242ac110002, 2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out1'])))
entity(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, [prov:type='wfprov:Artifact'])
specializationOf(data/95/2f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d)
wasGeneratedBy(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, run:c42dc36e-6dfd-11e7-bc24-0242ac110002, 2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out2'])))
entity(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, [prov:type='wfprov:Artifact'])
specializationOf(data/3d/eb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0)
// step2 ends
wasEndedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:30:00+01:00)
// only step output out1 captured by mother workflow, sent to resultB workflow output
wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultB'])
// mother workflow ends
wasEndedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:34:40+01:00)
endDocument
<prov:wasGeneratedBy>
<prov:entity prov:ref="ex:ent1"/>
<prov:activity prov:ref="ex:act1"/>
<prov:time>2017-10-26T21:32:52Z</prov:time>
<ex:port>p1</ex:port>
</prov:wasGeneratedBy>
wasGeneratedBy(ent1, act1,
2017-10-26T21:32:52Z, ex:port="p1")
:ent1
a prov:Entity;
prov:wasGeneratedBy :act1;
prov:generatedAtTime "2017-10-26T21:32:52Z"^^xsd:dateTime ;
ex:port "p1" .
"wasGeneratedBy": {
"ex:gen1": {
"prov:entity": "ent1",
"prov:activity": "act1",
"prov:time": "2017-10-26T21:32:52Z",
"ex:port": "p1"
},
},
{ "@context": { .. },
"@id": "ent1",
"@type": "prov:Entity",
"ex:port": "p1",
"prov:generatedAtTime": "2017-10-26T21:32:52Z",
"prov:wasGeneratedBy": {
"@id": "act1",
"@type": "prov:Activity"
}
}
PROV-N
PROV-XML
PROV-JSON
PROV-O Turtle
PROV-O JSON-LD
Output 1B file is also Input 2C and Input 3D downstream
Simple filenames -> duplications
./data/step1/outputB.txt
./data/step2/inputC.txt
./data/step3/inputD.txt
Content-adressable
SHA-256 hash of bytes as filename:
./data/51/51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d
RFC6920 URI as global identifier:
nih:sha-256;51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d
IEEE2791-2020
RO-Crate as an index
ro-crate-metadata.json
{
"@id": "#DataCapture_wcc02",
"@type": "CreateAction",
"agent": {
"@id": "https://orcid.org/0000-0002-1672-552X"
},
"instrument": {
"@id": "https://confluence.csiro.au/display/ASL/Hovermap"
},
"object": {
"@id": "#victoria_arch"
},
"result": [
{
"@id": "wcc02_arch.laz"
},
{
"@id": "wcc02_arch_traj.txt"
}
]
},
{
"@id": "#victoria_arch",
"@type": "Place",
"address": "Wombeyan Caves, NSW 2580",
"name": "Victoria Arch"
}
{"@context": "https://w3id.org/ro/crate/1.1/context",
"@graph" : [
{
"@id": "#Photo_Capture_1",
"@type": "CreateAction",
"agent": {
"@id": "https://orcid.org/0000-0002-3545-944X"
},
"description": "Photo snapped on a photo walk on a misty day",
"endTime": "2017-06-11T12:56:14+10:00",
"instrument": [
{
"@id": "#EPL1"
},
{
"@id": "#Panny20mm"
}
],
"result": {
"@id": "pics/2017-06-11%2012.56.14.jpg"
}
},
{
"@id": "#SepiaConversion_1",
"@type": "CreateAction",
"name": "Convert dog image to sepia",
"description": "convert -sepia-tone 80% test_data/sample/pics/2017-06-11\\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg",
"endTime": "2018-09-19T17:01:07+10:00",
"instrument": {
"@id": "https://www.imagemagick.org/"
},
"object": {
"@id": "pics/2017-06-11%2012.56.14.jpg"
},
"result": {
"@id": "pics/sepia_fence.jpg"
}
},
{
"@id": "https://www.imagemagick.org/",
"@type": "SoftwareApplication",
"url": "https://www.imagemagick.org/",
"name": "ImageMagick",
"version": "ImageMagick 6.9.7-4 Q16 x86_64 20170114 http://www.imagemagick.org"
}
]
}
{
"@id": "#test1",
"@type": "TestSuite",
"mainEntity": {"@id": "sort-and-change-case.ga"},
"instance": [
{"@id": "#test1_1"}
],
"definition": {"@id": "test/test1/sort-and-change-case-test.yml"}
},
{
"@id": "#test1_1",
"@type": "TestInstance",
"runsOn": {"@id": "https://w3id.org/ro/terms/test#JenkinsService"},
"url": "http://example.org/jenkins",
"resource": "job/tests/"
},
{
"@id": "https://w3id.org/ro/terms/test#JenkinsService",
"@type": "TestService",
"name": "Jenkins",
"url": {"@id": "https://www.jenkins.io"}
},
{
"@id": "test/test1/my-test.yml",
"@type": [
"File",
"TestDefinition"
],
"conformsTo": {"@id": "https://w3id.org/ro/terms/test#PlanemoEngine"},
"engineVersion": ">=0.70"
},
{
"@id": "https://w3id.org/ro/terms/test#PlanemoEngine",
"@type": "SoftwareApplication",
"name": "Planemo",
"url": {"@id": "https://github.com/galaxyproject/planemo"}
}
Workflow language & version
Workflow engine & version (e.g. Toil)
Workflow definition
Input data (or pointers to such)
Parameters? What can be implicit and explicit? (see BCO?)
Tool Dependencies to install (mostly implied by CWL/Nextflow/Galaxy, but might need versions/repos)
Container platform requirement [e.g. Docker, Conda]
Operating system requirement
Hardware requirements (memory, CPU, GPU)
Equivalent of AWS cloud instance type sufficient?
Where to run/submit (e.g. usegalaxy.eu)
Explicit/resolved container IDs
Archive containers from Docker Hub (protect against image expiration)
...
Join discussion in the
Workflow Hub Club community!
https://about.workflowhub.eu/
--> Separation of concern
Next call: Thu 27 May 2021 20:00 UTC