Mark Robinson
Supervisor: Carole Goble
The original data can be analyzed to obtain the same results of the original study
Reproducibility is important because the data is the only thing that can be guaranteed about a study
Reproducibility is critical - one of the main principles of the scientific method
Source: https://xkcd.com/242/
Computation-based science publication is currently a doubtful enterprise because there is not enough support for identifying and rooting out sources of error in computational work
Donoho (Biostatistics 2010)
Portability, Preservation
Packaging, Containers
Access
Standards
Common APIs
Licensing, IDs
Description
Standards,
Common Metadata
Robustness, Versioning
Change
Variation Sensitivity
Discrepancy Handling
Provenance
Steps
Dependencies
Very useful in Bioinformatics due to large scale and repetitive processes
Series of Computational/Data Management Steps
An example of a simple workflow that retrieves a weather forecast for the specified city
A useful paradigm to describe, manage, and share complex scientific analyses
Apache Taverna "Why Use Workflows"
Important that it 'just runs'
Workflows used for Variant calling, RNA sequencing and small RNA analysis
https://bcbio-nextgen.readthedocs.io/
Manage command line tools and web services
Infrastructure to setup, execute, and monitor scientific workflows
Allows analysis of intermediate steps and complete provenance information
Basically does the input/output 'plumbing' and 'rewiring' if tools are swapped out or workflow changes
Common to have a series of tools written by different people used together
Moves too quickly for one end to end tool
Even directories for these eg https://bio.tools/
Often the need to swap around tools to try different techniques without changing the whole workflow
But there are hundreds of these tools and they have no standards for inputs and outputs
Highlights the need for a workflow management system to manage these conversions and changes
+ Many, many more
YesWorkflow
CWL is a community led standard way of expressing and running workflows
Competing standards make collaboration difficult
Workflows written in YAML or JSON
CWL: Participating Organisations
cwlVersion: v1.0
class: Workflow
inputs:
inp: File
ex: string
outputs:
classout:
type: File
outputSource: compile/classfile
steps:
untar:
run: tar-param.cwl
in:
tarfile: inp
extractfile: ex
out: [example_out]
compile:
run: arguments.cwl
in:
src: untar/example_out
out: [classfile]
http://www.commonwl.org/v1.0/UserGuide.html#First_workflow
Extracts a java source file from a tar file and then compiles it
External Tools
inp:
class: File
path: hello.tar
ex: Hello.java
workflow.cwl
workflow-job.yml
http://www.commonwl.org/v1.0/UserGuide.html#First_workflow
$ echo "public class Hello {}" > Hello.java && tar -cvf hello.tar Hello.java
$ cwl-runner workflow.cwl workflow-job.yml
[job untar] /tmp/tmp94qFiM$ tar xf /home/example/hello.tar Hello.java
[step untar] completion status is success
[job compile] /tmp/tmpu1iaKL$ docker run -i --volume=/tmp/tmp94qFiM/Hello.java:/var/lib/cwl/job301600808_tmp94qFiM/Hello.java:ro --volume=/tmp/tmpu1iaKL:/var/spool/cwl:rw --volume=/tmp/tmpfZnNdR:/tmp:rw --workdir=/var/spool/cwl --read-only=true --net=none --user=1001 --rm --env=TMPDIR=/tmp java:7 javac -d /var/spool/cwl /var/lib/cwl/job301600808_tmp94qFiM/Hello.java
[step compile] completion status is success
[workflow workflow.cwl] outdir is /home/example
Final process status is success
{
"classout": {
"location": "/home/example/Hello.class",
"checksum": "sha1$e68df795c0686e9aa1a1195536bd900f5f417b18",
"class": "File",
"size": 416
}
}
Hello.class produced
Provenance information for outputs also given
{
"classout": {
"location": "/home/example/Hello.class",
"checksum": "sha1$e68df795c0686e9aa1a1195536bd900f5f417b18",
"class": "File",
"size": 416
}
}
But how do we interpret these outputs, especially with increasing numbers of them?
Linked data can help with this problem and be used to describe these outputs and more within CWL
Linked Data is about using the Web to connect related data that wasn't previously linked
Source: http://linkeddata.org/
Enables data from different sources to be connected and queried in a way which can be read automatically by computers
Can be expressed as JSON-LD (below) or RDF (uses XML)
{
"@context": "http://json-ld.org/contexts/person.jsonld",
"@id": "http://dbpedia.org/resource/John_Lennon",
"name": "John Lennon",
"born": "1940-10-09",
"spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
{
"@context":
{
"Person": "http://xmlns.com/foaf/0.1/Person",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"name": "http://xmlns.com/foaf/0.1/name",
"nickname": "http://xmlns.com/foaf/0.1/nick",
"affiliation": "http://schema.org/affiliation",
"depiction":
{
"@id": "http://xmlns.com/foaf/0.1/depiction",
"@type": "@id"
},
"image":
{
"@id": "http://xmlns.com/foaf/0.1/img",
"@type": "@id"
},
"born":
{
"@id": "http://schema.org/birthDate",
"@type": "xsd:dateTime"
},
"child":
{
"@id": "http://schema.org/children",
"@type": "@id"
},
"colleague":
{
"@id": "http://schema.org/colleagues",
"@type": "@id"
},
"knows":
{
"@id": "http://xmlns.com/foaf/0.1/knows",
"@type": "@id"
},
"died":
{
"@id": "http://schema.org/deathDate",
"@type": "xsd:dateTime"
},
"email":
{
"@id": "http://xmlns.com/foaf/0.1/mbox",
"@type": "@id"
},
"familyName": "http://xmlns.com/foaf/0.1/familyName",
"givenName": "http://xmlns.com/foaf/0.1/givenName",
"gender": "http://schema.org/gender",
"homepage":
{
"@id": "http://xmlns.com/foaf/0.1/homepage",
"@type": "@id"
},
"honorificPrefix": "http://schema.org/honorificPrefix",
"honorificSuffix": "http://schema.org/honorificSuffix",
"jobTitle": "http://xmlns.com/foaf/0.1/title",
"nationality": "http://schema.org/nationality",
"parent":
{
"@id": "http://schema.org/parent",
"@type": "@id"
},
"sibling":
{
"@id": "http://schema.org/sibling",
"@type": "@id"
},
"spouse":
{
"@id": "http://schema.org/spouse",
"@type": "@id"
},
"telephone": "http://schema.org/telephone",
"Address": "http://www.w3.org/2006/vcard/ns#Address",
"address": "http://www.w3.org/2006/vcard/ns#address",
"street": "http://www.w3.org/2006/vcard/ns#street-address",
"locality": "http://www.w3.org/2006/vcard/ns#locality",
"region": "http://www.w3.org/2006/vcard/ns#region",
"country": "http://www.w3.org/2006/vcard/ns#country",
"postalCode": "http://www.w3.org/2006/vcard/ns#postal-code"
}
}
@context is used to map terms to URIs and here provides the standard fields must meet
@id uniquely identifies the resource with a URI
Portability, Preservation
Packaging, Containers
Common Workflow Language supports the use of Docker containers as an environment
Packages an application into a standardised unit containing everything needed to run
Workflows are designed to be shared
Guarantees the software will run regardless of environment
Contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server
Description
Standards,
Common Metadata
Robustness, Versioning
Change
Variation Sensitivity
Discrepancy Handling
No Standards for Description
No Versioning or File Integrity Verification
However there is an entire project focused on writing manifests for collections of research data developed by wf4ever
Executable and complete description of how computationally derived research results were made
This is the missing piece of the puzzle for reproducible workflows
A 'Profile' defines an expectation of the purpose of the Research Object
What files and metadata should be expected
Assumptions which can be safely made about contents
But how do we define a profile?
Description
Standards,
Common Metadata
In a way the idea is too flexible
Needs to account for various fields/software
Eg found during my work so far:
Concept of 'Main Workflow' vs Nested
Nested Workflows
User enters a Github URL of a directory to get the workflow from
Each CWL file is parsed and the main workflow is found
Inputs, outputs and step details are collected
Research Object bundle is constructed
Page is created for the workflow with visualisation, details and download link
Github API
Parse CWL
Visualisation
Construct RO
Existing Libraries are Available
inputs:
input1: string
inputs2: int
inputs:
input:
id: input1
type: string
input:
id: input2
type: int
inputs:
input1: string[]
inputs:
input:
id: input1
type: array
items: string
inputs:
input1: float?
inputs:
input:
id: input1
type: ["null", float]
Examples for Inputs:
http://www.commonwl.org/v1.0/SchemaSalad.html
Eg Field Name Resolution:
{
"$namespaces": {
"acid": "http://example.com/acid#"
},
"$graph": [{
"name": "ExampleType",
"type": "record",
"fields": [{
"name": "base",
"type": "string",
"jsonldPredicate": "http://example.com/base"
}]
}]
}
{
"base": "one",
"form": {
"http://example.com/base": "two",
"http://example.com/three": "three",
},
"acid:four": "four"
}
{
"base": "one",
"form": {
"base": "two",
"http://example.com/three": "three",
},
"http://example.com/acid#four": "four"
}
https://github.com/MarkRobbo/CWLViewer
This will be a visual tool for sharing workflows, so a human readable visualisation is essential
User Story
As a user I want to be able to view a graphic visualisation of the workflow along with its details so that I can easily see at a glance what the workflow does and the steps which make it up
An RO bundle is a zip container for Research Objects which makes a workflow easy to download and provides the helpful manifest previously discussed
User Story
As a user I want to be able to download the workflow I am viewing in the form of a Research Object Bundle so that I can easily run it and understand the process/results by viewing useful metadata such as identification, attribution etc in the manifest
Supporting linked data is an important aspect to supporting all possible kinds of workflow which could be imported in the site
User Stories
As a user I want to be able to import workflows which utilise linked data in Common Workflow Language so that the website can be used to share them
As a user I want to be able to view information about linked data contained within workflows so that I can easily understand the context under which they are run and the results are resolved
Because externally linked data can change over time and it is important to have a stable reference to a workflow, these resources can be saved and linked locally
User Story
As a user I want to be able to download a Research Object Bundle of a workflow where the externally linked resources are frozen at a particular time so that the contents are stable from when it was originally referenced and changes will not prevent the running of the workflow or disrupt the results
I have also engaged with the CWL community and the contributors have agreed to host the finished product at:
http://view.commonwl.org/
This means I can also get feedback on the state of the application as an ongoing process