Data provenance
with RO-Crate

Stian Soiland-Reyes

eScience lab, The University of Manchester

INDElab, University of Amsterdam

International FAIR Convergence Symposium
(FAIR Data Provenance)

2020-12-02

https://doi.org/10.1038/d41586-019-01307-2

They ride with what I refer to as the four horsemen of the reproducibility apocalypse:
  1. Publication bias
  2. Low statistical power
  3. P-value hacking
  4. HARKing (hypothesizing after results are known)

 

 

State of art in reproducibility

Reproducibility?

CWLProv

Capturing workflow provenance in a research object

CWLProv explained by example:

https://w3id.org/cwl/prov

Separation of concern

 

Transfer: BagIt

Manifest: ORE/RO JSON-LD

Workflow description: wfdesc (Turtle)
Workflow run (PROV +wfprov)
Workflow definition: CWL
Tool interoperability: Docker
Data: Content-adressable files

Excessive FAIR considered
dangerous for your health

Semantic Web world vs Real World

 

Peter Sefton at Open Repositories 2019

 https://eresearch.uts.edu.au/2019/07/01/DataCrate-OR2019.htm

FAIR is not just machine-readable!

16k RO-Crates underneath the hood:

http://www.eopas.org/

IEEE2791-2020

RO-Crate as an index

ro-crate-metadata.json

What about PROV?

{
      "@id": "#DataCapture_wcc02",
      "@type": "CreateAction",
      "agent": {
        "@id": "https://orcid.org/0000-0002-1672-552X"
      },
      "instrument": {
        "@id": "https://confluence.csiro.au/display/ASL/Hovermap"
      },
      "object": {
        "@id": "#victoria_arch"
      },
      "result": [
        {
          "@id": "wcc02_arch.laz"
        },
        {
          "@id": "wcc02_arch_traj.txt"
        }
      ]
    },
  {
      "@id": "#victoria_arch",
      "@type": "Place",
      "address": "Wombeyan Caves, NSW 2580",
      "name": "Victoria Arch"
  }

Software as instrument

{"@context": "https://w3id.org/ro/crate/1.1/context",
"@graph" : [
   {
      "@id": "#Photo_Capture_1",
      "@type": "CreateAction",
      "agent": {
        "@id": "https://orcid.org/0000-0002-3545-944X"
      },
      "description": "Photo snapped on a photo walk on a misty day",
      "endTime": "2017-06-11T12:56:14+10:00",
      "instrument": [
        {
          "@id": "#EPL1"
        },
        {
          "@id": "#Panny20mm"
        }
      ],
      "result": {
        "@id": "pics/2017-06-11%2012.56.14.jpg"
      }
    },
    {
      "@id": "#SepiaConversion_1",
      "@type": "CreateAction",
      "name": "Convert dog image to sepia",
      "description": "convert -sepia-tone 80% test_data/sample/pics/2017-06-11\\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg",
      "endTime": "2018-09-19T17:01:07+10:00",
      "instrument": {
        "@id": "https://www.imagemagick.org/"
      },
      "object": {
        "@id": "pics/2017-06-11%2012.56.14.jpg"
      },
      "result": {
        "@id": "pics/sepia_fence.jpg"
      }
    },
{
      "@id": "https://www.imagemagick.org/",
      "@type": "SoftwareApplication",
      "url": "https://www.imagemagick.org/",
      "name": "ImageMagick",
      "version": "ImageMagick 6.9.7-4 Q16 x86_64 20170114 http://www.imagemagick.org"
}

]
}

Job specification as
Prospective Provenance

{
    "tmpformat": "ro/workflow/test-metadata/0.1",
    "@id": "test-metadata.json",
    "test": [
	{
  	    "name": "dtests",
  	    "instance": [
    		{
      		    "name": "dtests",
      		    "service": {
        		"type": "jenkins",
        		"url": "http://172.30.10.90:8080/",
			"resource": "job/dtests/"
      		    }
    		}
  	    ],
  	    "definition": {
		"test_engine": {
		    "type": "planemo",
		    "version": ">=0.70"
		},
		"path": "path relative to the directory containing this file"
            }
	}
    ]
}

What is needed for a Workflow Run RO-Crate?

  • Workflow language & version

  • Workflow engine & version (e.g. Toil)

  • Workflow definition

  • Input data (or pointers to such)

  • Parameters? What can be implicit and explicit? (see BCO?)

  • Tool Dependencies to install (mostly implied by CWL/Nextflow/Galaxy, but might need versions/repos)

  • Container platform requirement [e.g. Docker, Conda]

  • Operating system requirement

  • Hardware requirements (memory, CPU, GPU)

    • Equivalent of AWS cloud instance type sufficient?

  • Where to run/submit (e.g. usegalaxy.eu)

  • Explicit/resolved container IDs

  • Archive containers from Docker Hub (protect against image expiration)

  • ...

Join discussion in the
Workflow Hub Club community!
https://about.workflowhub.eu/

--> Separation of concern

Join the RO-Crate Community!

GitHub issue #1

github.com/researchobject/ro-crate

Next call: Thu 7 Jan 2021 20:00 UTC

https://www.researchobject.org/ro-crate/community

2020-12-02 Recording provenance with RO-Crate

By Stian Soiland-Reyes

2020-12-02 Recording provenance with RO-Crate

Presented 2020-12-02 at International FAIR Convergence Symposium session on FAIR Data Provenance.

  • 1,622