Inês Mendes
Bioinformatics PhD student.
@ines_cim
cimendes
Inês Mendes
Instituto de Medicina Molecular João Lobo Antunes
Through the looking glass
of Data and Software
Can person X, with the same data and the same methodology, obtain the same conclusions as person Y?
The needs:
Findable, Accessible, Interoperable, Reusable
https://doi.org/10.5281/zenodo.3332807
https://doi.org/10.1038/sdata.2016.18
Findable: The first step in (re)using data is to find them! Descriptive metadata (information about the data such as keywords) are essential.
Accessible: Once the user finds the data and software they need to know how to access it. Data could be openly available but it is also possible that authentication and authorisation procedures are necessary.
Interoperable: Data needs to be integrated with other data and interoperate with applications or workflows.
Reusable: Data should be well-described so that they can be used, combined, and extended in different settings.
But software is not data.
- Everyone, everywhere
The quality of the form of the software can be covered by FAIR data principles
The quality of the functionality of the software goes beyond the FAIR principles:
Form versus function of software
Versioning, Collaboration and Accountability
Collaboration: VCS (such as Git) was designed to solve the problem of multiple people working on the same code.
Storing versions properly: A stage of the project can be saved, through a commit, alongside a message. In the end, you only have one version on your disk of a project. Everything else is neatly packed up inside the VCS.
Auditabillity: VCS offers you an easy way to track "who made what change and when", allowing you to (1) debug more effectively, (2) track the reason that certain changes were made and (3) find the person who made a change.
Reproducibility, Deployability
In the Paleolithic era:
Virtual Machines
“Bare Metal” Installation
In the Modern days:
Software Containers
Runs the same regardless of the environment.
A container image is a lightweight, stand-alone, executable package of a software that includes everything needed to run it:
Host Hardware
Host Hardware
Container Engine
Host OS
Host OS
Hypervisor
Guest OS
Guest OS
App
App
Guest OS
VM1
VM2
App
App
App
App
Virtual Machines
Containers
build
pull & run
host
push
Reproducibility, Scalability, Sharability
The game changing combination of workflow managers and containers:
Workflows in the Modern era:
Enables scalable and reproducible scientific workflows. It simplifies the deployment of complex parallel and reactive workflows.
Reactive workflow framework
Create pipelines with asynchronous (and implicitly parallelized) data streams
Programing DSL
Has its own language for building a pipeline
Containerized
Out of the box integration with containers engines (Docker, Singularity, Shifter)
The creation of workflow pipelines was designed for bioinformaticians familiar with programming.
It's execution is for everyone.
The final frontier
Design
Code
Test
Test
Code
Refactor
Traditional Technique Approach
Test Driven Development Approach
In the field of microbial bioinformatics, good software engineering practices are not yet widely adopted. (...) This paper serves as a resource that could help microbial bioinformaticians get started with software testing if they have not had formal training.
The CSIS repository aims to promote the uptake of testing practices and engage the community in its adoption for public health.
This repository is an open-source project that gathers guidance, guidelines and examples for software testing for microbial bioinformatics researchers.
A proof of concept that the adoption of new standards can be crowdsourced.
Include test files with known expected outcomes for a successful run.
Include files or other inputs on which the tool is expected to fail.
unittest (https://docs.python.org/3/library/unittest.html)
pytest (https://docs.pytest.org/en/stable/)
(https://testthat.r-lib.org/) testthat
Try it out!
Special thanks to Pedro Vila-Cerqueira, Rafael Mamede and Mário Ramirez.
FCT PhD Grants SFRH/BD/129483/2017
COVID/BD/152618/2022
MRamirez Lab, iMM
2019
By Inês Mendes