Making reproducible
data packages

Stian Soiland-Reyes

eScience lab, The University of Manchester

Open Science @ University College Cork

2018-11-01

Findable

Accessible

Interoperable

Reusable

F1. (meta)data are assigned a globally unique and persistent identifier

 

F2. data are described with rich metadata (defined by R1 below)

 

F3. metadata clearly and explicitly include the identifier of the data it describes

 

F4. (meta)data are registered or indexed in a searchable resource

 

To be Findable:

A1. (meta)data are retrievable by their identifier using a
    standardized communications protocol
 

  A1.1 the protocol is open, free, and universally implementable


  A1.2 the protocol allows for an authentication and authorization
           procedure, where necessary

 

A2. metadata are accessible, even when the data are no longer available

To be Accessible:

I1. (meta)data use a formal, accessible, shared, and
    broadly applicable language for knowledge representation.

 

I2. (meta)data use vocabularies that follow FAIR principles

 

I3. (meta)data include qualified references to other (meta)data

To be Interoperable:

R1. meta(data) are richly described with a plurality of
    accurate and relevant attributes

 

  R1.1. (meta)data are released with a clear and accessible data usage license
 

  R1.2. (meta)data are associated with detailed provenance

 

  R1.3. (meta)data meet domain-relevant community standards

To be Reusable:

Data repositories

Selecting a repository

Simplicity over completeness

http,doi.org,web,www,mining,words,lines
138,7,14,23,1,7627,2944
56,33,37,31,2,10686,1680
52,25,9,22,4,9579,1366
44,29,19,20,6,10222,3181
42,11,20,16,0,8268,1394
40,12,26,6,1,10354,4338
39,0,184,23,5,10901,3022
37,13,3,18,6,7567,2801
37,10,23,18,18,9387,2903
34,15,14,17,0,4346,826
33,7,20,13,0,10853,3187
33,16,32,21,2,9506,1233
33,10,12,20,1,8672,1579
31,1,28,3,2,8606,1335
30,7,6,3,1,10101,3313
29,16,5,14,0,7337,1270
29,11,12,17,0,5263,863
28,5,17,15,2,5438,800
28,2,18,29,0,5578,2268
28,19,4,2,8,11235,3392
28,0,6,4,1,6390,2116
27,4,53,6,0,11034,4235
27,23,21,13,0,10496,1488
27,21,108,3,10,7769,1220
26,4,33,6,5,8071,1184
26,24,14,36,4,10676,3571
26,17,71,4,2,8773,1317
25,15,8,15,0,9630,1782
25,13,27,19,4,8116,1198
24,17,7,15,1,8718,1137
24,10,16,20,5,8814,1337
23,14,78,22,0,10250,3919
22,8,47,6,0,7994,1243
22,16,2,16,2,8985,1245
22,14,12,4,3,7761,1337
22,10,30,1,0,7348,1089
21,15,9,3,17,6825,1020
20,7,17,16,0,9286,1526
20,5,13,2,0,10520,2886
20,12,17,16,1,8193,2673
19,2,77,10,0,9383,4010
19,0,35,3,0,11106,1386
17,4,3,17,4,8301,1007

Minimal effort towards interoperable:

CSV

Assembling a dataset

A Research Object bundles and relates digital resources of a scientific experiment or investigation:

 

Data used and results produced in experimental study

Methods employed to produce and analyse that data

Provenance and settings for the experiments

People involved in the investigation

Annotations about these resources, to improve understanding and interpretation

Research Object

id:        doi:10.15490/seek.1.investigation.56
createdOn: 2015-07-10T16:46:00Z
createdBy: http://orcid.org/0000-0001-9842-9718

aggregates:
 - id:         data/sequence/specimen5.bam
   conformsTo: http://gemrb.org/iesdp/file_formats/ie_formats/bam_v1.htm

 - id:         http://example.com/blog/about-specimen5
   authoredBy: http://orcid.org/0000-0001-7066-3350

 - id:         http://www.myexperiment.org/workflows/3355
   history:    provenance/workflow-evolution.ttl

annotations:
 - about:       data/sequence/specimen5.bam
   content:     annotations/specimen5-properties.jsonld
   createdBy:   http://orcid.org/0000-0001-7066-3350

 - about:       data/sequence/specimen5.bam
   content:     http://example.com/blog/about-specimen5
   motivatedBy: oa:questioning

Research Object manifest

(simplified)

Reuse standards:
OAI-ORE, BagIt, W3C JSON-LD, PROV, Web Annotation Model

metadata/manifest.json
data/sequence/specimen5.bam
provenance/workflow-evolution.ttl
http://example.com/blog/about-specimen5
http://www.myexperiment.org/workflows/335

http://orcid.org/0000-0001-7066-3350
http://gemrb.org/iesdb/
   file_formats_ie_formats_bam_v1.html
pip install bdbag

 

Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.

PROV Model Primer

W3C Working Group Note 30 April 2013

How can we capture the methods?

cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
  ex: string

outputs:
  classout:
    type: File
    outputSource: compile/classfile

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: inp
      extractfile: ex
    out: [example_out]

  compile:
    run: arguments.cwl
    in:
      src: untar/example_out
    out: [classfile]
{
  "@context" : [ "https://w3id.org/bundle/context" ],
  "id" : "/",
  "manifest" : [ "manifest.json" ],
  "createdOn" : "2017-08-24T10:57:46.325Z",
  "createdBy" : {
    "uri" : "https://view.commonwl.org",
    "name" : "Common Workflow Language Viewer"
  },
  "authoredBy" : [ {
    "uri" : "mailto:peter.amstutz@curoverse.com",
    "name" : "Peter Amstutz"
  }, {
    "uri" : "mailto:luka.stojanovic@sbgenomics.com",
    "name" : "Luka Stojanovic"
  }, {
    "uri" : "mailto:crusoe@ucdavis.edu",
    "name" : "Michael R. Crusoe"
  }, {
    "uri" : "mailto:porter@porter.st",
    "name" : "Andrey Kartashov"
  }, {
    "uri" : "mailto:janko.simonovic@sbgenomics.com",
    "name" : "Janko Simonovic"
  } ],
  "retrievedFrom" : "https://github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/",
  "retrievedOn" : "2017-08-24T10:57:46.325Z",
  "retrievedBy" : {
    "uri" : "https://view.commonwl.org",
    "name" : "Common Workflow Language Viewer"
  },
  "history" : [ "http:/git2prov.org/git2prov?giturl=https:/github.com/common-workflow-language/workflows.git&serialization=PROV-JSON" ],
  "aggregates" : [ {
    "uri" : "/workflow/tmp_2.fq",
    "mediatype" : "application/octet-stream",
    "createdOn" : "2017-08-24T10:57:46.923Z",
    "authoredBy" : [ {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/tmp_2.fq",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:61579f3e-63e6-49c2-b780-f67b2df461b7",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/lobSTR-demo.json",
    "mediatype" : "application/json",
    "createdOn" : "2017-08-24T10:57:47.216Z",
    "authoredBy" : [ {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/lobSTR-demo.json",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:973caa0e-f3bd-45e8-8d29-70123bc8715a",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/models/illumina_v3.pcrfree.stuttermodel",
    "mediatype" : "application/octet-stream",
    "createdOn" : "2017-08-24T10:57:47.239Z",
    "authoredBy" : [ {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/models/illumina_v3.pcrfree.stuttermodel",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:62bbcbea-f34f-463f-990d-6148f8ed5e5c",
      "folder" : "/workflow/models/"
    }
  }, {
    "uri" : "/workflow/models/illumina_v3.pcrfree.stepmodel",
    "mediatype" : "application/octet-stream",
    "createdOn" : "2017-08-24T10:57:47.266Z",
    "authoredBy" : [ {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/models/illumina_v3.pcrfree.stepmodel",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:03439ae7-cd94-42a3-b5fe-40bfff6882d8",
      "folder" : "/workflow/models/"
    }
  }, {
    "uri" : "/workflow/samtools-sort.cwl",
    "mediatype" : "text/x-yaml",
    "createdOn" : "2017-08-24T10:57:47.269Z",
    "authoredBy" : [ {
      "uri" : "mailto:luka.stojanovic@sbgenomics.com",
      "name" : "Luka Stojanovic"
    }, {
      "uri" : "mailto:crusoe@ucdavis.edu",
      "name" : "Michael R. Crusoe"
    }, {
      "uri" : "mailto:porter@porter.st",
      "name" : "Andrey Kartashov"
    }, {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/samtools-sort.cwl",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "conformsTo" : "https://w3id.org/cwl/v1.0",
    "bundledAs" : {
      "uri" : "urn:uuid:2dc07859-efc2-4945-a95f-ba7815b68d07",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/lobSTR-workflow.cwl",
    "mediatype" : "text/x-yaml",
    "createdOn" : "2017-08-24T10:57:47.42Z",
    "authoredBy" : [ {
      "uri" : "mailto:luka.stojanovic@sbgenomics.com",
      "name" : "Luka Stojanovic"
    }, {
      "uri" : "mailto:crusoe@ucdavis.edu",
      "name" : "Michael R. Crusoe"
    }, {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "conformsTo" : "https://w3id.org/cwl/v1.0",
    "bundledAs" : {
      "uri" : "urn:uuid:58bc1895-3460-46d6-91d7-fa1718d09631",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/lobSTR-arvados-demo.json",
    "mediatype" : "application/json",
    "createdOn" : "2017-08-24T10:57:47.453Z",
    "authoredBy" : [ {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/lobSTR-arvados-demo.json",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:30c683bc-69fb-4d93-8dad-65b663783af5",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/samtools-index.cwl",
    "mediatype" : "text/x-yaml",
    "createdOn" : "2017-08-24T10:57:47.458Z",
    "authoredBy" : [ {
      "uri" : "mailto:luka.stojanovic@sbgenomics.com",
      "name" : "Luka Stojanovic"
    }, {
      "uri" : "mailto:crusoe@ucdavis.edu",
      "name" : "Michael R. Crusoe"
    }, {
      "uri" : "mailto:porter@porter.st",
      "name" : "Andrey Kartashov"
    }, {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/samtools-index.cwl",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "conformsTo" : "https://w3id.org/cwl/v1.0",
    "bundledAs" : {
      "uri" : "urn:uuid:8235d3f8-6927-4f73-b160-8521838a1cbb",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/lobSTR-tool.cwl",
    "mediatype" : "text/x-yaml",
    "createdOn" : "2017-08-24T10:57:47.476Z",
    "authoredBy" : [ {
      "uri" : "mailto:luka.stojanovic@sbgenomics.com",
      "name" : "Luka Stojanovic"
    }, {
      "uri" : "mailto:crusoe@ucdavis.edu",
      "name" : "Michael R. Crusoe"
    }, {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/lobSTR-tool.cwl",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "conformsTo" : "https://w3id.org/cwl/v1.0",
    "bundledAs" : {
      "uri" : "urn:uuid:7fa6fbe4-1fc5-4cb5-9c1a-56b96c5f7aaf",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/allelotype.cwl",
    "mediatype" : "text/x-yaml",
    "createdOn" : "2017-08-24T10:57:47.537Z",
    "authoredBy" : [ {
      "uri" : "mailto:luka.stojanovic@sbgenomics.com",
      "name" : "Luka Stojanovic"
    }, {
      "uri" : "mailto:janko.simonovic@sbgenomics.com",
      "name" : "Janko Simonovic"
    }, {
      "uri" : "mailto:crusoe@ucdavis.edu",
      "name" : "Michael R. Crusoe"
    }, {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/allelotype.cwl",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "conformsTo" : "https://w3id.org/cwl/v1.0",
    "bundledAs" : {
      "uri" : "urn:uuid:3706bd2f-e53f-431d-b32a-deb661d9b292",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/README",
    "mediatype" : "application/octet-stream",
    "createdOn" : "2017-08-24T10:57:47.555Z",
    "authoredBy" : [ {
      "uri" : "mailto:crusoe@ucdavis.edu",
      "name" : "Michael R. Crusoe"
    }, {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/README",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:ed54c4d6-c585-4dc9-b7bc-0cf299e20b91",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/workflow/tmp_1.fq",
    "mediatype" : "application/octet-stream",
    "createdOn" : "2017-08-24T10:57:47.738Z",
    "authoredBy" : [ {
      "uri" : "mailto:peter.amstutz@curoverse.com",
      "name" : "Peter Amstutz"
    } ],
    "retrievedFrom" : "https://raw.githubusercontent.com/common-workflow-language/workflows/lobstr-v1/workflows/lobSTR/tmp_1.fq",
    "retrievedBy" : {
      "uri" : "https://view.commonwl.org",
      "name" : "Common Workflow Language Viewer"
    },
    "bundledAs" : {
      "uri" : "urn:uuid:5d431f81-ad0b-4acf-903a-9d5aa03b04df",
      "folder" : "/workflow/"
    }
  }, {
    "uri" : "/visualisation.png",
    "mediatype" : "image/png",
    "createdOn" : "2017-08-24T10:57:47.801Z",
    "retrievedFrom" : "https://view.commonwl.org/graph/png/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl",
    "bundledAs" : {
      "uri" : "urn:uuid:ff9ace37-e76c-49f8-8d36-60f11ff6d257",
      "folder" : "/"
    }
  }, {
    "uri" : "/visualisation.svg",
    "mediatype" : "image/svg+xml",
    "createdOn" : "2017-08-24T10:57:47.821Z",
    "retrievedFrom" : "https://view.commonwl.org/graph/svg/github.com/common-workflow-language/workflows/blob/lobstr-v1/workflows/lobSTR/lobSTR-workflow.cwl",
    "bundledAs" : {
      "uri" : "urn:uuid:a6cfb437-8818-4ab2-9081-efc74c5109e8",
      "folder" : "/"
    }
  } ],
  "annotations" : [ {
    "uri" : "urn:uuid:9f602fff-b280-41c5-9590-ab95a49c85ad",
    "about" : "/",
    "content" : "annotations/merged.cwl"
  }, {
    "uri" : "urn:uuid:0ce4b727-ff61-4534-9afb-e3d676d2782d",
    "about" : "/",
    "content" : "annotations/workflow.ttl"
  } ]
}
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow

label: "Hello World"
doc: "Outputs a message using echo"

inputs: []

outputs:
  response:
    outputSource: step0/response
    type: File

steps:
  step0:
    run:
      class: CommandLineTool
      inputs:
        message:
          type: string
          doc: "The message to print"
          default: "Hello World"
          inputBinding:
            position: 1
      baseCommand: echo
      stdout: response.txt
      outputs:
        response:
          type: stdout
    in: []
    out: [response]
pip install cwl-runner

CWLProv

Workflow provenance as Research Object

$ cwlprov --help
usage: cwlprov [-h] [--version] [--directory DIRECTORY] [--relative]
            [--absolute] [--output OUTPUT] [--verbose] [--quiet] [--hints]
            [--no-hints]
            {validate,info,who,prov,inputs,outputs,run,runs,rerun,derived,runtimes}
            ...

cwlprov explores Research Objects containing provenance of Common Workflow
Language executions. <https://w3id.org/cwl/prov/>

commands:
{validate,info,who,prov,inputs,outputs,run,runs,rerun,derived,runtimes}
    validate            Validate the CWLProv Research Object
    info                show research object Metadata
    who                 show Who ran the workflow
    prov                export workflow execution Provenance in PROV format
    inputs              list workflow/step Input files/values
    outputs             list workflow/step Output files/values
    run                 show workflow Execution log
    runs                List all workflow executions in RO
    rerun               Rerun a workflow or step
    derived             list what was Derived from a data item, based on
                        activity usage/generation
    runtimes            calculate average step execution Runtimes

(venv3) stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ cwlprov run
2018-08-08 22:44:06.573330 Flow 39408a40-c1c8-4852-9747-87249425be1e [ Run of workflow/packed.cwl#main 
2018-08-08 22:44:06.691722 Step 4f082fb6-3e4d-4a21-82e3-c685ce3deb58   Run of workflow/packed.cwl#main/create-tar  (0:00:00.010133)
2018-08-08 22:44:06.702976 Step 0cceeaf6-4109-4f08-940b-f06ac959944a * Run of workflow/packed.cwl#main/compile  (unknown duration)
2018-08-08 22:44:12.680097 Flow 39408a40-c1c8-4852-9747-87249425be1e ] Run of workflow/packed.cwl#main  (0:00:06.106767)
Legend:
[ Workflow start
* Nested provenance, use UUID to explore: cwlprov run 0cceeaf6-4109-4f08-940b-f06ac959944a
] Workflow end

(venv3) stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ cwlprov run 0cceeaf6-4109-4f08-940b-f06ac959944a
2018-08-08 22:44:06.607210 Flow 0cceeaf6-4109-4f08-940b-f06ac959944a [ Run of workflow/packed.cwl#main 
2018-08-08 22:44:06.707070 Step 83752ab4-8227-4d4a-8baa-78376df34aed   Run of workflow/packed.cwl#main/untar  (0:00:00.008149)
2018-08-08 22:44:06.718554 Step f56d8478-a190-4251-84d9-7f69fe0f6f8b   Run of workflow/packed.cwl#main/argument  (0:00:00.532052)
2018-08-08 22:44:07.251588 Flow 0cceeaf6-4109-4f08-940b-f06ac959944a ] Run of workflow/packed.cwl#main  (0:00:00.644378)
Legend:
[ Workflow start
] Workflow end
stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ cwlprov outputs 4f082fb6-3e4d-4a21-82e3-c685ce3deb58 --format=files
Output tar:
data/c0/c0fd5812fe6d8d91fef7f4f1ba3a462500fce0c5

stain@biggie:~/src/cwlprov-py/test/nested-cwlprov-0.3.0$ tar tfv `cwlprov -q outputs 4f082fb6-3e4d-4a21-82e3-c685ce3deb58 --format=files`
-rw-r--r-- stain/stain     115 2018-08-08 23:44 Hello.java

Inspecting step runs

Who is using Research Objects?

2018-11-01 Making reproducible data packages

By Stian Soiland-Reyes

2018-11-01 Making reproducible data packages

Presented at Open Science @ University College Cork on 2018-11-01

  • 2,211