Reproducible Research using Research Objects

Mark Robinson

Supervisor: Carole Goble

Reproducibility in Research

The original data can be analyzed to obtain the same results of the original study

Reproducibility is important because the data is the only thing that can be guaranteed about a study

Reproducibility is critical - one of the main principles of the scientific method

Source: https://xkcd.com/242/

Computation-based science publication is currently a doubtful enterprise because there is not enough support for identifying and rooting out sources of error in computational work

Donoho (Biostatistics 2010)

Reproducibility Dimensions

Portability, Preservation

Packaging, Containers

Access

Standards

Common APIs

Licensing, IDs

Description

Standards,

Common Metadata

Robustness, Versioning

Change

Variation Sensitivity

Discrepancy Handling

Provenance

Steps

Dependencies

Scientific Workflows

Very useful in Bioinformatics due to large scale and repetitive processes

Series of Computational/Data Management Steps

An example of a simple workflow that retrieves a weather forecast for the specified city

A useful paradigm to describe, manage, and share complex scientific analyses

Apache Taverna "Why Use Workflows"

Important that it 'just runs'

Example: bcbio

Workflows used for Variant calling, RNA sequencing and small RNA analysis  

https://bcbio-nextgen.readthedocs.io/

Workflow Management Systems

Manage command line tools and web services

Infrastructure to setup, execute, and monitor scientific workflows

Allows analysis of intermediate steps and complete provenance information

Basically does the input/output 'plumbing' and 'rewiring' if tools are swapped out or workflow changes

Bioinformatics Use

Common to have a series of tools written by different people used together

Moves too quickly for one end to end tool

Even directories for these eg https://bio.tools/

Bioinformatics Use

Often the need to swap around tools to try different techniques without changing the whole workflow

But there are hundreds of these tools and they have no standards for inputs and outputs

Highlights the need for a workflow management system to manage these conversions and changes

Workflow Management Systems

+ Many, many more

YesWorkflow

CWL is a community led standard way of expressing and running workflows

Competing standards make collaboration difficult

Workflows written in YAML or JSON

CWL: Participating Organisations

cwlVersion: v1.0
class: Workflow
inputs:
  inp: File
  ex: string

outputs:
  classout:
    type: File
    outputSource: compile/classfile

steps:
  untar:
    run: tar-param.cwl
    in:
      tarfile: inp
      extractfile: ex
    out: [example_out]

  compile:
    run: arguments.cwl
    in:
      src: untar/example_out
    out: [classfile]

CWL: Expressing Workflows

http://www.commonwl.org/v1.0/UserGuide.html#First_workflow

Extracts a java source file from a tar file and then compiles it

External Tools

inp:
  class: File
  path: hello.tar
ex: Hello.java

workflow.cwl

workflow-job.yml

CWL: Running a Workflow

http://www.commonwl.org/v1.0/UserGuide.html#First_workflow

$ echo "public class Hello {}" > Hello.java && tar -cvf hello.tar Hello.java
$ cwl-runner workflow.cwl workflow-job.yml
[job untar] /tmp/tmp94qFiM$ tar xf /home/example/hello.tar Hello.java
[step untar] completion status is success
[job compile] /tmp/tmpu1iaKL$ docker run -i --volume=/tmp/tmp94qFiM/Hello.java:/var/lib/cwl/job301600808_tmp94qFiM/Hello.java:ro --volume=/tmp/tmpu1iaKL:/var/spool/cwl:rw --volume=/tmp/tmpfZnNdR:/tmp:rw --workdir=/var/spool/cwl --read-only=true --net=none --user=1001 --rm --env=TMPDIR=/tmp java:7 javac -d /var/spool/cwl /var/lib/cwl/job301600808_tmp94qFiM/Hello.java
[step compile] completion status is success
[workflow workflow.cwl] outdir is /home/example
Final process status is success
{
  "classout": {
    "location": "/home/example/Hello.class",
    "checksum": "sha1$e68df795c0686e9aa1a1195536bd900f5f417b18",
    "class": "File",
    "size": 416
  }
}

Hello.class produced

Provenance information for outputs also given 

CWL: Linked Data

{
  "classout": {
    "location": "/home/example/Hello.class",
    "checksum": "sha1$e68df795c0686e9aa1a1195536bd900f5f417b18",
    "class": "File",
    "size": 416
  }
}

But how do we interpret these outputs, especially with increasing numbers of them?

Linked data can help with this problem and be used to describe these outputs and more within CWL

Linked Data

Linked Data is about using the Web to connect related data that wasn't previously linked

Source: http://linkeddata.org/

Enables data from different sources to be connected and queried in a way which can be read automatically by computers

Linked Data Example

Can be expressed as JSON-LD (below) or RDF (uses XML)

{
  "@context": "http://json-ld.org/contexts/person.jsonld",
  "@id": "http://dbpedia.org/resource/John_Lennon",
  "name": "John Lennon",
  "born": "1940-10-09",
  "spouse": "http://dbpedia.org/resource/Cynthia_Lennon"
}
{
   "@context":
   {
      "Person": "http://xmlns.com/foaf/0.1/Person",
      "xsd": "http://www.w3.org/2001/XMLSchema#",
      "name": "http://xmlns.com/foaf/0.1/name",
      "nickname": "http://xmlns.com/foaf/0.1/nick",
      "affiliation": "http://schema.org/affiliation",
      "depiction":
      {
         "@id": "http://xmlns.com/foaf/0.1/depiction",
         "@type": "@id"
      },
      "image":
      {
         "@id": "http://xmlns.com/foaf/0.1/img",
         "@type": "@id"
      },
      "born":
      {
         "@id": "http://schema.org/birthDate",
         "@type": "xsd:dateTime"
      },
      "child":
      {
         "@id": "http://schema.org/children",
         "@type": "@id"
      },
      "colleague":
      {
         "@id": "http://schema.org/colleagues",
         "@type": "@id"
      },
      "knows":
      {
         "@id": "http://xmlns.com/foaf/0.1/knows",
         "@type": "@id"
      },
      "died":
      {
         "@id": "http://schema.org/deathDate",
         "@type": "xsd:dateTime"
      },
      "email":
      {
         "@id": "http://xmlns.com/foaf/0.1/mbox",
         "@type": "@id"
      },
      "familyName": "http://xmlns.com/foaf/0.1/familyName",
      "givenName": "http://xmlns.com/foaf/0.1/givenName",
      "gender": "http://schema.org/gender",
      "homepage":
      {
         "@id": "http://xmlns.com/foaf/0.1/homepage",
         "@type": "@id"
      },
      "honorificPrefix": "http://schema.org/honorificPrefix",
      "honorificSuffix": "http://schema.org/honorificSuffix",
      "jobTitle": "http://xmlns.com/foaf/0.1/title",
      "nationality": "http://schema.org/nationality",
      "parent":
      {
         "@id": "http://schema.org/parent",
         "@type": "@id"
      },
      "sibling":
      {
         "@id": "http://schema.org/sibling",
         "@type": "@id"
      },
      "spouse":
      {
         "@id": "http://schema.org/spouse",
         "@type": "@id"
      },
      "telephone": "http://schema.org/telephone",
      "Address": "http://www.w3.org/2006/vcard/ns#Address",
      "address": "http://www.w3.org/2006/vcard/ns#address",
      "street": "http://www.w3.org/2006/vcard/ns#street-address",
      "locality": "http://www.w3.org/2006/vcard/ns#locality",
      "region": "http://www.w3.org/2006/vcard/ns#region",
      "country": "http://www.w3.org/2006/vcard/ns#country",
      "postalCode": "http://www.w3.org/2006/vcard/ns#postal-code"
   }
}

@context is used to map terms to URIs and here provides the standard fields must meet

@id uniquely identifies the resource with a URI

CWL: Portability

Portability, Preservation

Packaging, Containers

Common Workflow Language supports the use of Docker containers as an environment

Packages an application into a standardised unit containing everything needed to run

Workflows are designed to be shared

Guarantees the software will run regardless of environment

Contains everything needed to run: code, runtime, system tools, system libraries – anything that can be installed on a server

Problem with CWL

  • No manifest for workflows
    • Missing useful metadata about the workflow

Description

Standards,

Common Metadata

Robustness, Versioning

Change

Variation Sensitivity

Discrepancy Handling

No Standards for Description

No Versioning or File Integrity Verification

However there is an entire project focused on writing manifests for collections of research data developed by wf4ever

Executable and complete description of how computationally derived research results were made

My Project

This is the missing piece of the puzzle for reproducible workflows

Research Object Profiles

A 'Profile' defines an expectation of the purpose of the Research Object

What files and metadata should be expected

Assumptions which can be safely made about contents

But how do we define a profile?

Description

Standards,

Common Metadata

In a way the idea is too flexible

Needs to account for various fields/software 

Existing Manifest Examples

Solution: Top-Down Approach

  • Find the format used in an existing workflow tool eg Apache Taverna
  • Use this to create a specification

Apache Taverna Data Bundle

CWL Parsing

Solution: Bottom-Up Approach

  • Take a workflow and parse it to attempt to find details
  • What is missing is useful to put in a manifest

Eg found during my work so far:

Concept of 'Main Workflow' vs Nested

Nested Workflows

My Project

Help define a Workflow Research Object Profile

Develop a web application to visualise, package and share CWL workflows

CWL Viewer - Architecture

  • Model-View-Controller Java Web Application
    • Spring Framework
  • MongoDB - NoSQL Database
    • Flexible schema
    • CWL is already parsed to JSON format

User enters a Github URL of a directory to get the workflow from

Each CWL file is parsed and the main workflow is found

Inputs, outputs and step details are collected

Research Object bundle is constructed

Page is created for the workflow with visualisation, details and download link

Github API

Parse CWL

Visualisation

Construct RO

CWL Viewer - Basic Flow

Existing Libraries are Available

  • Currently not formally defined
    • Must first define what a profile 'is' and the scope of what it defines
    • Difficulty in deciding what is useful and should be contained in a manifest outside of the needs of this application

RO Profile

  • Ambiguity
    • Easier to write by hand, hard to parse
inputs:
  input1: string
  inputs2: int
inputs:
  input:
    id: input1
    type: string
  input:
    id: input2
    type: int
inputs:
  input1: string[]
inputs:
  input:
    id: input1
    type: array
    items: string
inputs:
  input1: float?
inputs:
  input:
    id: input1
    type: ["null", float]

Examples for Inputs:

Parsing CWL

Parsing CWL

  • Linked Data
    • Resolving for external resources/vocabulary
    • Adds a large amount of complexity

http://www.commonwl.org/v1.0/SchemaSalad.html

Eg Field Name Resolution:

{
  "$namespaces": {
    "acid": "http://example.com/acid#"
  },
  "$graph": [{
    "name": "ExampleType",
    "type": "record",
    "fields": [{
      "name": "base",
      "type": "string",
      "jsonldPredicate": "http://example.com/base"
    }]
  }]
}
{
  "base": "one",
  "form": {
    "http://example.com/base": "two",
    "http://example.com/three": "three",
  },
  "acid:four": "four"
}
{
  "base": "one",
  "form": {
    "base": "two",
    "http://example.com/three": "three",
  },
  "http://example.com/acid#four": "four"
}
  • Visualisation of directed acyclic graphs is a complex problem to solve
  • Workflow has a deep structure
    • Issue of information density vs readability
    • Drilldown into more details?

Visualisation

  • Hard to evaluate success
    • Human factor - must meet the needs of the audience
    • Make it look like familiar existing tools or try to find a better format?

Visualisation

So Far...

  • Java MVC Web Application
  • Github URL Parsing
  • Github API Functionality (Fetching Files)
  • Basic CWL Parsing

https://github.com/MarkRobbo/CWLViewer

Planned Milestone 1

  • Visualisation of Workflows in Graphs

This will be a visual tool for sharing workflows, so a human readable visualisation is essential

User Story

As a user I want to be able to view a graphic visualisation of the workflow along with its details so that I can easily see at a glance what the workflow does and the steps which make it up

Planned Milestone 2

  • Downloadable Research Object Bundle

An RO bundle is a zip container for Research Objects which makes a workflow easy to download and provides the helpful manifest previously discussed

User Story

As a user I want to be able to download the workflow I am viewing in the form of a Research Object Bundle so that I can easily run it and understand the process/results by viewing useful metadata such as identification, attribution etc in the manifest

Planned Milestone 3

  • CWL Linked Data Support

Supporting linked data is an important aspect to supporting all possible kinds of workflow which could be imported in the site

User Stories

As a user I want to be able to import workflows which utilise linked data in Common Workflow Language so that the website can be used to share them

As a user I want to be able to view information about linked data contained within workflows so that I can easily understand the context under which they are run and the results are resolved

Planned Milestone 4

  • Archiving External Linked Data

Because externally linked data can change over time and it is important to have a stable reference to a workflow, these resources can be saved and linked locally

User Story

As a user I want to be able to download a Research Object Bundle of a workflow where the externally linked resources are frozen at a particular time so that the contents are stable from when it was originally referenced and changes will not prevent the running of the workflow or disrupt the results

CWL Community

I have also engaged with the CWL community and the contributors have agreed to host the finished product at:

http://view.commonwl.org/

This means I can also get feedback on the state of the application as an ongoing process

Questions?

Made with Slides.com