Reproducible Research using Research Objects
Mark Robinson
Supervisor: Carole Goble
Reproducibility in Research
The original data can be analyzed to obtain the same results of the original study
Reproducibility is important because the data is the only thing that can be guaranteed about a study
Reproducibility is critical - one of the main principles of the scientific method
Source: https://xkcd.com/242/
Reproducibility Dimensions
Portability, Preservation
Packaging, Containers
Access
Standards
Common APIs
Licensing, IDs
Description
Standards,
Common Metadata
Robustness, Versioning
Change
Variation Sensitivity
Discrepancy Handling
Provenance
Steps
Dependencies
Scientific Workflows
Very useful in Bioinformatics due to large scale and repetitive processes
Series of Computational/Data Management Steps
An example of a simple workflow that retrieves a weather forecast for the specified city
A useful paradigm to describe, manage, and share complex scientific analyses
Apache Taverna "Why Use Workflows"
Important that it 'just runs'
Example: bcbio
Workflows used for Variant calling, RNA sequencing and small RNA analysis
https://bcbio-nextgen.readthedocs.io/
My Project
Develop a web application to visualise, package and share workflows
Description for workflows and their tools
Metadata description - manifests for containers of files
Bioinformatics Use
Common to have a series of tools written by different people used together
Moves too quickly for one end to end tool
Hundreds of tools available - Even directories for these eg https://bio.tools/
Workflow Management Systems
Manage command line tools and web services
Infrastructure to setup, execute, and monitor scientific workflows
Allows analysis of intermediate steps and complete provenance information
Basically does the input/output 'plumbing' and 'rewiring' if tools are swapped out or workflow changes
Workflow Management Systems
+ Many, many more
YesWorkflow
CWL is a community led standard way of expressing and running workflows and command line tools
Competing standards make collaboration difficult
Written in YAML or JSON
Description
Standards,
Common Metadata
cwlVersion: v1.0
class: Workflow
inputs:
inp: File
ex: string
outputs:
classout:
type: File
outputSource: compile/classfile
steps:
untar:
run: tar-param.cwl
in:
tarfile: inp
extractfile: ex
out: [example_out]
compile:
run: arguments.cwl
in:
src: untar/example_out
out: [classfile]
CWL: Expressing Workflows
http://www.commonwl.org/v1.0/UserGuide.html#First_workflow
Extracts a java source file from a tar file and then compiles it
External Tools
inp:
class: File
path: hello.tar
ex: Hello.java
workflow.cwl
workflow-job.yml
CWL: Running a Workflow
http://www.commonwl.org/v1.0/UserGuide.html#First_workflow
$ echo "public class Hello {}" > Hello.java && tar -cvf hello.tar Hello.java
$ cwl-runner workflow.cwl workflow-job.yml
[job untar] /tmp/tmp94qFiM$ tar xf /home/example/hello.tar Hello.java
[step untar] completion status is success
[job compile] /tmp/tmpu1iaKL$ docker run -i --volume=/tmp/tmp94qFiM/Hello.java:/var/lib/cwl/job301600808_tmp94qFiM/Hello.java:ro --volume=/tmp/tmpu1iaKL:/var/spool/cwl:rw --volume=/tmp/tmpfZnNdR:/tmp:rw --workdir=/var/spool/cwl --read-only=true --net=none --user=1001 --rm --env=TMPDIR=/tmp java:7 javac -d /var/spool/cwl /var/lib/cwl/job301600808_tmp94qFiM/Hello.java
[step compile] completion status is success
[workflow workflow.cwl] outdir is /home/example
Final process status is success
{
"classout": {
"location": "/home/example/Hello.class",
"checksum": "sha1$e68df795c0686e9aa1a1195536bd900f5f417b18",
"class": "File",
"size": 416
}
}
Hello.class produced
Provenance information for outputs also given
CWL: Portability
Portability, Preservation
Packaging, Containers
CWL Workflows can be packaged into various containers for sharing
Workflows are designed to be shared
Zip
Bagit
Problem with CWL
- No manifest for workflows
- Missing useful metadata about the workflow
Description
Standards,
Common Metadata
Robustness, Versioning
Change
Variation Sensitivity
Discrepancy Handling
No Standards for Metadata
No Versioning or File Integrity Verification
However there is an entire project focused on writing manifests for collections of research data developed by wf4ever
Executable and complete description of how computationally derived research results were made
My Project
This is the missing piece of the puzzle for reproducible workflows
Research Object Profiles
A 'Profile' defines an expectation of the purpose of the Research Object
What files and metadata should be expected
Assumptions which can be safely made about contents
But how do we define a profile?
Description
Standards,
Common Metadata
In a way the idea is too flexible
Needs to account for various fields/software
Top-Down Approach
Solution
- Find the format used in an existing workflow tool eg Apache Taverna
- Use this to create a specification for the manifest in a general case
Bottom-Up Approach
- Take a workflow and parse it to attempt to find details which are already available
- What would be useful extra information to have in a general case?
My Project
Help define a Workflow Research Object Profile
Develop a web application to visualise, package and share CWL workflows
CWL Viewer - Architecture
- Model-View-Controller Java Web Application
- Spring Framework
- MongoDB - NoSQL Database
- Flexible schema
- CWL is already parsed to JSON format
User enters a Github URL of a directory to get the workflow from
Each CWL file is parsed and the main workflow is found
Inputs, outputs and step details are collected
Research Object bundle is constructed
Page is created for the workflow with visualisation, details and download link
Github API
Parse CWL
Visualisation
Construct RO
CWL Viewer - Basic Flow
Existing Libraries are Available
Async
Add to
Page
- Currently not formally defined
- Must first define what a profile 'is' and the scope of what it defines
- Difficulty in deciding what is useful and should be contained in a manifest outside of the needs of this application
RO Profile
- Ambiguity
- Easier to write by hand, hard to parse
inputs:
input1: string
inputs2: int
inputs:
input:
id: input1
type: string
input:
id: input2
type: int
inputs:
input1: string[]
inputs:
input:
id: input1
type: array
items: string
inputs:
input1: float?
inputs:
input:
id: input1
type: ["null", float]
Examples for Inputs:
Parsing CWL
Parsing CWL
- Linked Data
- Resolving for external resources/vocabulary
- Adds a large amount of complexity
http://www.commonwl.org/v1.0/SchemaSalad.html
Eg Field Name Resolution:
{
"$namespaces": {
"acid": "http://example.com/acid#"
},
"$graph": [{
"name": "ExampleType",
"type": "record",
"fields": [{
"name": "base",
"type": "string",
"jsonldPredicate": "http://example.com/base"
}]
}]
}
{
"base": "one",
"form": {
"http://example.com/base": "two",
"http://example.com/three": "three",
},
"acid:four": "four"
}
{
"base": "one",
"form": {
"base": "two",
"http://example.com/three": "three",
},
"http://example.com/acid#four": "four"
}
- Visualisation of directed acyclic graphs is a complex problem to solve
- Workflow has a deep structure
- Issue of information density vs readability
- Drilldown into more details?
Visualisation
-
Hard to evaluate success
- Human factor - must meet the needs of the audience
- Make it look like familiar existing tools or try to find a better format?
Visualisation
So Far...
- Java MVC Web Application
- Github URL Parsing
- Github API Functionality (Fetching Files)
- Basic CWL Parsing
https://github.com/MarkRobbo/CWLViewer
Planned Milestone 1
- Downloadable Research Object Bundle
An RO bundle is a zip container for Research Objects which makes a workflow easy to download and provides the helpful manifest previously discussed
Planned Milestone 2
- Visualisation of Workflows in Graphs
This will be a visual tool for sharing workflows, so a human readable visualisation is essential
Planned Milestone 3
- CWL Linked Data Support
Supporting linked data is an important aspect to supporting all possible kinds of workflow which could be imported in the site
Planned Milestone 4
- Archiving External Linked Data
Because externally linked data can change over time and it is important to have a stable reference to a workflow, these resources can be saved and linked locally
CWL Community
I have also engaged with the CWL community and the contributors have agreed to host the finished product at:
http://view.commonwl.org/
This means I can also get feedback on the state of the application as an ongoing process
Questions?
Third Year Project Seminar
By Mark Robinson
Third Year Project Seminar
Seminar describing my third year project on "Reproducible Research with Research Objects"
- 1,890