Transparency in Bioinformatics

@ines_cim

cimendes

Inês Mendes

Instituto de Medicina Molecular João Lobo Antunes

Through the looking glass

of Data and Software

Can person X, with the same data and the same methodology, obtain the same conclusions as person Y?

Black

Box

Glass

Box

Commercial/Freeware
You get what it gives you
Ready to use
Stealth change
Standalone

Freeware
You can "tailor"
"Major" headache
Visible change
Dependencies

Reproducibility

| The needs

Do I know what is happening?
Is it reproducible?
Is it shareable?

Reproducibility

| The needs

The needs:

Analyze a large amount of sequence data routinely

Some computationally intensive steps

Constantly updating/adding software

Reproducibility

| The needs

Reproducibility

| The challenges

No standard way of describing experiments, environments, (derived) data, and workflows.

No transparency in creating environments and steps/methods to recreate analysis.

The experimental nature of the research code and ecosystem makes it often hard to build.

Unresolved or undocumented dependencies.

Infrastructure for storage and distribution.

What is our role, as computational biologists, in addressing these challenges?

FAIR Data Principles

Findable, Accessible, Interoperable, Reusable

FAIR Principles

| Not just for data

https://doi.org/10.5281/zenodo.3332807

FAIR Principles

| Not just for data

https://doi.org/10.1038/sdata.2016.18

FAIR Principles

| Not just for data

Findable: The first step in (re)using data is to find them! Descriptive metadata (information about the data such as keywords) are essential.

Accessible: Once the user finds the data and software they need to know how to access it. Data could be openly available but it is also possible that authentication and authorisation procedures are necessary.

Interoperable: Data needs to be integrated with other data and interoperate with applications or workflows.

Reusable: Data should be well-described so that they can be used, combined, and extended in different settings.

But software is not data.

- Everyone, everywhere

FAIR Principles

| Not just for data

The quality of the form of the software can be covered by FAIR data principles

Code quality
Maintainability

The quality of the functionality of the software goes beyond the FAIR principles:

Correctness
Security
Efficiency

Form versus function of software

Version Control

Versioning, Collaboration and Accountability

Version Control

| What is it?

Version Control

| What is it?

Version Control

| What does it allow for?

Collaboration: VCS (such as Git) was designed to solve the problem of multiple people working on the same code.

Storing versions properly: A stage of the project can be saved, through a commit, alongside a message. In the end, you only have one version on your disk of a project. Everything else is neatly packed up inside the VCS.

Auditabillity: VCS offers you an easy way to track "who made what change and when", allowing you to (1) debug more effectively, (2) track the reason that certain changes were made and (3) find the person who made a change.

Container Software

Reproducibility, Deployability

Software Containers

| What is it?

In the Paleolithic era:

Virtual Machines

“Bare Metal” Installation

Software Containers

| What is it?

In the Modern days:

Software Containers

Software Containers

Runs the same regardless of the environment.

A container image is a lightweight, stand-alone, executable package of a software that includes everything needed to run it:

code
runtime
system tools
system libraries

| What is it?

Software Containers

| What is it?

Host Hardware

Container Engine

Host OS

Hypervisor

Guest OS

App

Guest OS

VM1

VM2

App

Virtual Machines

Containers

Docker Hub

build

pull & run

host

push

Docker

| The ecosysthem

Workflow Managers

Reproducibility, Scalability, Sharability

The game changing combination of workflow managers and containers:

Portability
Reproducible
Scalability
Multi-scale containerization
Native cloud support

Reproducibility

| The motivation

Workflows in the Modern era:

Workflow Managers

| What is it?

Enables scalable and reproducible scientific workflows. It simplifies the deployment of complex parallel and reactive workflows.

Reactive workflow framework

Create pipelines with asynchronous (and implicitly parallelized) data streams

Programing DSL

Has its own language for building a pipeline

Containerized

Out of the box integration with containers engines (Docker, Singularity, Shifter)

The creation of workflow pipelines was designed for bioinformaticians familiar with programming.

It's execution is for everyone.

Workflow Managers

| What is it?

Automatic management of temporary input/output directory/files

No need for custom handling of concurrency (parallelization)

A single pipeline can support any scripting language (Bash, Python, Perl, R...)

Every process (task) can be run in a container

It's portability allows for the same pipeline to run on a laptop, server, cluster, etc

Checkpoints and resume functionality

Host pipeline on GitHub and run remotely

Monolithic pipelines

Need to change often

Siloed tool containers

Don't do much by themselves

Software Testing

The final frontier

Code Development

| Approaches

Design

Code

Test

Code

Refactor

Traditional Technique Approach

Test Driven Development Approach

In the field of microbial bioinformatics, good software engineering practices are not yet widely adopted. (...) This paper serves as a resource that could help microbial bioinformaticians get started with software testing if they have not had formal training.

Recommendation #1

Establish software needs and testing goals

Recommendation #1

The CSIS repository aims to promote the uptake of testing practices and engage the community in its adoption for public health.

This repository is an open-source project that gathers guidance, guidelines and examples for software testing for microbial bioinformatics researchers.

A proof of concept that the adoption of new standards can be crowdsourced.