Data provenance
with RO-Crate

Stian Soiland-Reyes

eScience lab, The University of Manchester

INDElab, University of Amsterdam

EOSC-Life retreat 2021
Provenance of tools and workflows; FAIRification of workflows
2021-05-19

https://doi.org/10.1038/d41586-019-01307-2

They ride with what I refer to as the four horsemen of the reproducibility apocalypse:
  1. Publication bias
  2. Low statistical power
  3. P-value hacking
  4. HARKing (hypothesizing after results are known)

 

 

Reproducibility?

Excessive FAIR considered
dangerous for your health

Semantic Web world vs Real World

 

Peter Sefton at Open Repositories 2019

 https://eresearch.uts.edu.au/2019/07/01/DataCrate-OR2019.htm

FAIR is not just machine-readable!

16k RO-Crates underneath the hood:

http://www.eopas.org/

CWLProv

Capturing workflow provenance in a research object

CWLProv explained by example:

https://w3id.org/cwl/prov

Separation of concern

 

Transfer: BagIt

Manifest: ORE/RO JSON-LD

Workflow description: wfdesc (Turtle)
Workflow run (PROV +wfprov)
Workflow definition: CWL
Tool interoperability: Docker
Data: Content-adressable files

document
prefix wfprov <http://purl.org/wf4ever/wfprov#>
prefix prov <http://www.w3.org/ns/prov#>
prefix wfdesc <http://purl.org/wf4ever/wfdesc#>
prefix wf <https://w3id.org/cwl/view/git/933bf2a1a1cce32d88f88f136275535da9df0954/workflows/hello/hello.cwl#>
prefix input <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/workflow/input.json#>
prefix run <urn:uuid:>
prefix engine <urn:uuid:>
prefix data <urn:hash:sha256:>

default <app://579c1b74-b328-4da6-80a8-a2ffef2ac9b5/>

// Level 1 provenance of workflow run

activity(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, , , [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])    
    wasStartedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, -, 2017-10-27T14:24:00+01:00)  

    // The engine is the SoftwareAgent that is executing our Workflow plan
    wasAssociatedWith(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main)
        agent(engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool v1.2.5")
        // prov has no term to relate sub-plans - we'll use wfdesc:hasSubProcess
        entity(wf:main,[prov:type='wfdesc:Workflow', prov:type='prov:Plan', wfdesc:hasSubProcess='wf:main/step1', wfdesc:hasSubProcess='wf:main/step2'])
            alternateOf(wf:main, workflow/packed.cwl)
            entity(wf:main/step1,[prov:type='wfdesc:Process', prov:type='prov:Plan'])
            entity(wf:main/step2,[prov:type='wfdesc:Process', prov:type='prov:Plan'])            

    // First the workflow uses some data; here with a urn:sha:sha256 identifier
    used(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:29:00+01:00, [prov:role='wf:main/input1']))
        entity(data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, [prov:type='wfprov:Artifact'])
            // which we have stored a copy of within the research object
            specializationOf(data/58/5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03)

    // Then there was another activity - wfprov:ProcessRun indicating a command line tool
    activity(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step1"])
        // started by the mother activity
        wasStartedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00)
        // same engine using step1 as plan. In a distributed scenario there might be a different engine
        wasAssociatedWith(run:4305467e-6dfb-11e7-885d-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step1)
        // This activity also use the same data, but in a different role (e.g. input parameter)
        used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T14:00:00+01:00, [prov:role='wf:main/step1/in1'])

        // And we generate some new data
        wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:4305467e-6dfb-11e7-885d-0242ac110002, 2017-10-27T16:00:00+01:00, [prov:role='wf:main/step1/out1']))
            entity(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, [prov:type='wfprov:Artifact'])
                // again stored in the RO
                specializationOf(data/00/00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c)

        // step1 finished
        wasEndedBy(run:4305467e-6dfb-11e7-885d-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:30:00+01:00)

    // the master workflow then "generate" that same value, but now at a different time and role (the resultA master workflow output)
    wasGeneratedBy(data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultA'])

    // next step activity
    activity(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, - [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step2"])
        wasStartedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:00:00+01:00)
        // associated with step2
        wasAssociatedWith(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, engine:b2210211-8acb-4d58-bd28-2a36b18d3b4f, wf:main/step2)
        
        // Uses two data artifacts; one which came from previous step, other as workflow input
        used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueA'])
        used(run:4305467e-6dfb-11e7-885d-0242ac110002, data:00688350913f2f292943a274b57019d58889eda272370af261c84e78e204743c, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/step2/valueB'])
        
        // and generate two new data artifacts
        wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:c42dc36e-6dfd-11e7-bc24-0242ac110002,  2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out1'])))
            entity(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, [prov:type='wfprov:Artifact'])
                specializationOf(data/95/2f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d)

        wasGeneratedBy(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, run:c42dc36e-6dfd-11e7-bc24-0242ac110002,  2017-10-27T16:34:20+01:00, [prov:role='wf:main/step2/out2'])))
            entity(data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, [prov:type='wfprov:Artifact'])
                specializationOf(data/3d/eb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0, data:3deb00bd0decd1f21d015a178c4f23a5eb537588c08eeee9d55059ec29637be0)
        // step2 ends
        wasEndedBy(run:c42dc36e-6dfd-11e7-bc24-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:30:00+01:00)

    // only step output out1 captured by mother workflow, sent to resultB workflow output
    wasGeneratedBy(data:952f537d1f3116db56703787ace248fe00ae46fa77ea3803aa3d8dc01d221a9d, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T15:00:00+01:00, [prov:role='wf:main/resultB'])

    // mother workflow ends
    wasEndedBy(run:2e1287e0-6dfb-11e7-8acf-0242ac110002, -, -, run:2e1287e0-6dfb-11e7-8acf-0242ac110002, 2017-10-27T16:34:40+01:00)

endDocument

CWLProv

Which PROV format?

<prov:wasGeneratedBy>
  <prov:entity prov:ref="ex:ent1"/>
  <prov:activity prov:ref="ex:act1"/>
  <prov:time>2017-10-26T21:32:52Z</prov:time>
  <ex:port>p1</ex:port>
</prov:wasGeneratedBy>
wasGeneratedBy(ent1, act1, 
  2017-10-26T21:32:52Z, ex:port="p1")
:ent1
  a prov:Entity;
  prov:wasGeneratedBy :act1;
  prov:generatedAtTime "2017-10-26T21:32:52Z"^^xsd:dateTime ;
  ex:port "p1" .
    "wasGeneratedBy": {
        "ex:gen1": {
            "prov:entity": "ent1",
            "prov:activity": "act1",
            "prov:time": "2017-10-26T21:32:52Z",
            "ex:port": "p1"
        },
    },
{ "@context": { .. }, 
  "@id": "ent1",
  "@type": "prov:Entity",
  "ex:port": "p1",
  "prov:generatedAtTime":  "2017-10-26T21:32:52Z",
  "prov:wasGeneratedBy": {
    "@id": "act1",
    "@type": "prov:Activity"
  } 
}

PROV-N

PROV-XML

PROV-JSON

PROV-O Turtle

PROV-O JSON-LD

Identifying intermediate data

Output 1B file is also Input 2C and Input 3D downstream

Simple filenames -> duplications

  ./data/step1/outputB.txt 
./data/step2/inputC.txt
./data/step3/inputD.txt

 

Content-adressable

SHA-256 hash of bytes as filename:

./data/51/51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d

RFC6920 URI as global identifier:

nih:sha-256;51fb8af0c4ae0422fbe88340d91880ecb9d7537cf57339c1cf1256b7ca58f32d

Workflow provenance profiles

IEEE2791-2020

RO-Crate as an index

ro-crate-metadata.json

Workflow Provenance
in RO-Crate

{
      "@id": "#DataCapture_wcc02",
      "@type": "CreateAction",
      "agent": {
        "@id": "https://orcid.org/0000-0002-1672-552X"
      },
      "instrument": {
        "@id": "https://confluence.csiro.au/display/ASL/Hovermap"
      },
      "object": {
        "@id": "#victoria_arch"
      },
      "result": [
        {
          "@id": "wcc02_arch.laz"
        },
        {
          "@id": "wcc02_arch_traj.txt"
        }
      ]
    },
  {
      "@id": "#victoria_arch",
      "@type": "Place",
      "address": "Wombeyan Caves, NSW 2580",
      "name": "Victoria Arch"
  }

Software as instrument

{"@context": "https://w3id.org/ro/crate/1.1/context",
 "@graph" : [
   {
      "@id": "#Photo_Capture_1",
      "@type": "CreateAction",
      "agent": {
        "@id": "https://orcid.org/0000-0002-3545-944X"
      },
      "description": "Photo snapped on a photo walk on a misty day",
      "endTime": "2017-06-11T12:56:14+10:00",
      "instrument": [
        {
          "@id": "#EPL1"
        },
        {
          "@id": "#Panny20mm"
        }
      ],
      "result": {
        "@id": "pics/2017-06-11%2012.56.14.jpg"
      }
    },
    {
      "@id": "#SepiaConversion_1",
      "@type": "CreateAction",
      "name": "Convert dog image to sepia",
      "description": "convert -sepia-tone 80% test_data/sample/pics/2017-06-11\\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg",
      "endTime": "2018-09-19T17:01:07+10:00",
      "instrument": {
        "@id": "https://www.imagemagick.org/"
      },
      "object": {
        "@id": "pics/2017-06-11%2012.56.14.jpg"
      },
      "result": {
        "@id": "pics/sepia_fence.jpg"
      }
    },
{
      "@id": "https://www.imagemagick.org/",
      "@type": "SoftwareApplication",
      "url": "https://www.imagemagick.org/",
      "name": "ImageMagick",
      "version": "ImageMagick 6.9.7-4 Q16 x86_64 20170114 http://www.imagemagick.org"
}

]
}

Job specification as
Prospective Provenance

{
    "@id": "#test1",
    "@type": "TestSuite",
    "mainEntity": {"@id": "sort-and-change-case.ga"},
    "instance": [
        {"@id": "#test1_1"}
    ],
    "definition": {"@id": "test/test1/sort-and-change-case-test.yml"}
},
{
    "@id": "#test1_1",
    "@type": "TestInstance",
    "runsOn": {"@id": "https://w3id.org/ro/terms/test#JenkinsService"},
    "url": "http://example.org/jenkins",
    "resource": "job/tests/"
},
{
    "@id": "https://w3id.org/ro/terms/test#JenkinsService",
    "@type": "TestService",
    "name": "Jenkins",
    "url": {"@id": "https://www.jenkins.io"}
},
{
    "@id": "test/test1/my-test.yml",
    "@type": [
        "File",
        "TestDefinition"
    ],
    "conformsTo": {"@id": "https://w3id.org/ro/terms/test#PlanemoEngine"},
    "engineVersion": ">=0.70"
},
{
    "@id": "https://w3id.org/ro/terms/test#PlanemoEngine",
    "@type": "SoftwareApplication",
    "name": "Planemo",
    "url": {"@id": "https://github.com/galaxyproject/planemo"}
}

What is needed for a Workflow Run RO-Crate?

  • Workflow language & version

  • Workflow engine & version (e.g. Toil)

  • Workflow definition

  • Input data (or pointers to such)

  • Parameters? What can be implicit and explicit? (see BCO?)

  • Tool Dependencies to install (mostly implied by CWL/Nextflow/Galaxy, but might need versions/repos)

  • Container platform requirement [e.g. Docker, Conda]

  • Operating system requirement

  • Hardware requirements (memory, CPU, GPU)

    • Equivalent of AWS cloud instance type sufficient?

  • Where to run/submit (e.g. usegalaxy.eu)

  • Explicit/resolved container IDs

  • Archive containers from Docker Hub (protect against image expiration)

  • ...

Join discussion in the
Workflow Hub Club community!
https://about.workflowhub.eu/

--> Separation of concern

Join the RO-Crate Community!

GitHub issue #1

github.com/researchobject/ro-crate

Next call: Thu 27 May 2021 20:00 UTC

https://www.researchobject.org/ro-crate/community

2021-05-19 Recording provenance with RO-Crate

By Stian Soiland-Reyes

2021-05-19 Recording provenance with RO-Crate

2021-05-19 Notes for EOSC-Life retreat 2021, breakout on Provenance of tools and workflows; FAIRification of workflows

  • 1,594