Cloud of Reproducible Records

Faical Yannick P. Congo

faical.congo@nist.gov

MRaDS

Wednesday, September 27, 2017

Institute for Bioscience and Biotechnology Research

https://slides.com/faicalyannickcongo/corr/live

Buckle up for this Journey

BUT BEFORE...

LET'S START WITH SOME FACTS

version control?

execution management?

research pedigree?

code#execution versioning!

execution tools!

coRR!

The scientist's nightmares

IN...

3 Scenarios

Early RESEARCH BACK and Forth SWING

I have to

run my study again and

again on...

different datasets

different code

different parameters

How do i keep track of all that in a meaningful way to me and others?

Research pipeline rationale

week

day

month

year

semerter

rational 1

rational 2

rational 3

rational 4

rational 5

rational 6

How do i keep track of all the rationale i had during my study timeline?

scale

during a study timeline, a scientist will take actions for different reasons!

using existing publications materials

publication

extract these

reconstruct

execute

corroborate

research involves others using your study in theirs. it needs a lot more!

How do i capture everything needed for a pair to run my study easily?

Humm...

What I need is

sort of

A provenance tracking Tool?

A work-flow management tool?

A pedigree capture tool?

what about version control?

BUT

Which one?

Source Code

EXECUTION

YES: They are two very different things!

Source Code

EXECUTION

CODE MODIFICTAIONS

RUN CHANGES

Cleaner with Branches

LIBRARIES TRACKING

easy backups

SYSTEM Changes

why not use source codeversion control to do all the job?

Let's SEE...

The things that can affect your executions but not your code:

A change in hardware may affect your run. Your code less likely.

An OS update may affect your run. Change your code less likely.

Using different data may affect your run. Your code less likely.

Changing parameters may affect your run. Your code less likely.

seriously! Do not use source code version control for execution version control

Runs as different commits in the same branch?

Runs as different branches in the same repository?

Runs as different repositories?

each of these are not appropriate as is

RUNS as different commits in the same branch?

All runs will have to produce a file treated as source code

A commit becomes an nondeterminate state

Does not scale well for huge amounts of runs

BASICALLY treat a run as a code change?

RUNS as different branches in the same repository?

A branch becomes an nondeterminate state

BASICALLY treat a run as a new branch?

The notion of merging two branches has to be redefined.

Scaling is still a big problem.

RUNS as different repositories?

A repository becomes an nondeterminate state

The notion of a repository as a project loses all its sense

BASICALLY treat a run as a new repo?

Scaling is still a big problem.

so It is settled then!

CODE VERSION CONTROL IS NOT APPROPRIATE FOR EXECUTION

YET!

What is worse than not using version control for your code today?

Let's ASSUME we have the solution

Cloud Storage

without it, all your work is seating on a single point of failure: loss, corruption, deletion, ...

cloud storage and all its subsequent features are a must have!

WHY? The two best success Stories:

Github

bitbucket

BACKUP, COLLABORATION, exposure, dissemination, open-source, custom features

scientists need source code version control

scientists need a cloud storage option: Github|Bitbucket

BUT for executions

scientists need execution version control tools

scientists need a cloud storage alternative with them

luckily: There are quite a few!

executions management systems

NOTEBOOKS

WORKFLOWS

PROVENANCE

SYSTEM

HERE ARE 40 of them

Sumatra

reprozip

cde

panda

noworkflow

maestrowf

hubzero

galaxy

taverna

aiida

kepler

fireworks

ergatis

anduril

askalon

airavata

biobkie

autosubmit

bioclipse

hyperflow

cknowledge

cuneiform

nextflow

knime

nipype

openmole

ogrange

pegasus

scicumulus

vistrails

yabi

tavaxy

jupyter

zeppelin

beaker

nteract

kajero

ucalc

eve

hyperdeck

Great then: We just have to use these!

we are here because it is not that simple

YES, but wait...

with source code version control tools

we have the following features

git

Mercurial

subversion

bazaar

cvs

fossil

migration

cloud services

sadly out of these, only half are

Sumatra

reprozip

cde

panda

noworkflow

maestrowf

hubzero

galaxy

taverna

aiida

kepler

fireworks

ergatis

anduril

askalon

airavata

biobkie

autosubmit

bioclipse

hyperflow

cknowledge

cuneiform

nextflow

knime

nipype

openmole

ogrange

pegasus

scicumulus

vistrails

yabi

tavaxy

jupyter

zeppelin

beaker

nteract

kajero

ucalc

eve

hyperdeck

connected and none has a

migration capability

this gave birth to

Cloud of Reproducible Records

migration capability between the tools

common cloud services features

corr is a nist-mgi funded project

usnistgov/corr

OPEN-SOURCE

DEMO live soon

collaborations

docker-hub

corr [microservices] architecture

rest

file system | s3

rest

js | css | html5

mongodb

HOME TOP

HOME menu

HOME bottom

HOME docs

HOME search

corr ACCOUNT

corr DASHBOARD

corr projects

corr records

corr tools

research project demo

Credit: ShenG Yen Li, Daniel Wheeler

sem images

threshold

min-size

clean

reveal

pearlite

ferrite

cemmentite

save

json fractions

study

demo

1

2

3

4

5

8

7

6 without corr

with corr

study

sumatra

study

reprozip

study

maestrowf

study

sumatra

study

reprozip

study

maestrowf

study

def threshold(filename):
    result = dict(filename=filename,
                threshold_image=threshold_image(filename),
                **extract_metadata(filename))
    return result

def min_size(data):
    data['min_size'] = 
    f_min_size(data['scale_microns'],
               data['scale_pixels'])
    return data

def clean(data):
    data['clean_image'] = 
    ~remove_small_holes(~data['threshold_image'], 
                         data['min_size'])
    return data

def reveal(data):
    data['pearlite_image'] = 
    reveal_pearlite(data['clean_image'])
    return data

def cemmentite(data):
    data['cemmentite_fraction'] = 
    frac1(data['clean_image'])
    return data

def ferrite(data):
    data['ferrite_fraction'] = 
    frac0(data['clean_image'])
    return data

def pearlite(data):
    data['pearlite_fraction'] = 
    frac1(data['pearlite_image'])
return data

def save(data):
    clean_name = data['filename'].
                 split("/")[-1].
                 split(".")[0]
    file_path = "{0}.json".format(clean_name)

    filtered_data = {}
    filtered_data['filename'] = clean_name
    filtered_data['pearlite_fraction'] = data['pearlite_fraction']
    filtered_data['ferrite_fraction'] = data['ferrite_fraction']
    filtered_data['cemmentite_fraction'] = data['cemmentite_fraction']
    with open(file_path, "w") as save_file:
save_file.write(json.dumps(filtered_data, sort_keys=True, 
                           indent=4, separators=(',', ': ')))

demo

sumatra

## Setup Version Control
$ git init
$ git add --all
$ git commit -m "Setting up the repo."

## Setup Sumatra without CoRR
$ smt init SEM-Images-Smt .
## Setup Sumatra with CoRR
$ smt init -s=config.json SEM-Images-Smt .

## Run Study without Sumatra
$ python study.py

## Run Study with Sumatra
$ smt run --executable=python --main=study.py

## Produces:
## - A folder: .git

## Produces:
## - A folder: .smt
## - A file: .smt/project
## - A file: .smt/records

## Produces:
## - A list of json files.

## Produces:
## - A row in .smt/records
## - A list of json files.

demo

reprozip

## Trace the study run
$ reprozip trace SEM-Image-Rpz python study.py

## Trace with CoRR
$ reprozip trace -s=config_file SEM-Image-Rpz python study.py

## Produces:
## - A folder: .reprozip
## - A file: bundle.rpz

demo

maestrowf

## Run the study
$ maestro -s -d 1 -y -c -t 2 sem-study.yaml

## Archive the run with CoRR
$ archive -f config.json record_path/sem-study.yaml -d 1

## Produces:
## - A folder: sample_output
## - A folder: */sem-images-maestrowf
## - A folder: */*/*_recordID

projects in CoRR

contracts module

thank you!

Cloud of Reproducible Records

Faical Yannick P. Congo

faical.congo@nist.gov

Wednesday, September 27, 2017

Institute for Bioscience and Biotechnology Research

corr stickers coming up!

federation capability!

demo users registrations

questions?

more time? Hands on!

Cloud of Reproducible Records

Faical Yannick P. Congo

faical.congo@nist.gov

Wednesday, September 27, 2017

Institute for Bioscience and Biotechnology Research

how many laptops?

who has done this before?

instance available

follow me :-)

http://10.5.100.207:5000/

https://github.com/usnistgov/MRaDS-2017-Demo-Study