Open Science

Toni Hermoso Pulido (@toniher)

Bioinformatics Core Facility

Centre for Genomic Regulation (BCN)

https://biocore.crg.eu


A processes-focused introduction

PHINDaccess - February 2022

Open Science

not only a matter of outcomes, but also of processes

work in open

Open Science

... more than open access

Document

Write it down or ...

it didn't happen!

Document: Why?

  • Organise ideas
  • Understanding code and steps in the future for you and others
  • Fixing errors
  • Help in future publication

Document: Where?

  • File System (e.g. README or TODO files)
  • Control Version System
    • Git, SVN, etc.
  • Content Management System
    • Wiki CMS, Drupal, etc.
  • Electronic Lab Notebook (ELN)

Document: How?

  • Plain text
  • Format

Document: How?

  • Format
    • Structured
      • Config files
        • XML, JSON, INI, YAML
      • Templates (e.g. in wikis)
      • Database Management Systems (Relation or NoSQL)

Document: How?

  • When text is not possible
    • Ensure open formats
      • Interoperability
      • Avoid vendor lock-in
      • e.g., for images: PNG, TIFF
    • Whenever possible, favor lossless formats (e.g., JPEG < TIFF)

Document: visibility

  • Open / private
    • The more accessible, the easier for third parties to collaborate
    • Important to define the moment of disclosure
      • Publishing strategy
      • Engagement
      • Handling 3rd-party issues

Document: license

  • Check with your institution!
  • Copyleft or not? (e. g., GPL vs MIT)
  • Some licenses may be more suitable for some contents

Tag and track

I never said so!

Tag and track: Why?

 

  • Convenient backup
  • Error tracking and reversion
  • Checking history
  • Allowing collaboration on different time points
  • Publication of specific snapshots

Tag and track: Where?

 

Tag and track: Concepts

 

  • Revision, Version, Commit
  • Branch
  • Tag, Release
  • Fork, Pull request

Tag and track: Concepts

Tag and track: Publish

 

Reproduce

Run it again, Sam!

Reproduce: Why?

  • Nowadays not only textual statements but also code and data
  • Peers and collaborators should be able to reproduce by themselves
    • Check errors
    • Improve code, data
    • Test in different conditions

 

Standing on the shoulders of giants

Reproduce: How?

  • Code requirements, recipes
    • Scripts
    • Test frameworks
    • Package managers (e.g. Conda)
    • Jupyter
  • Virtualisation

Reproduce: Note on python

  • pyenv & pyenv-virtualenv
    • pyenv install x.y.z
    • pyenv virtualenv x.y.x myvenv
  • pip
    • pip freeze > requirements.txt
    • pip install -r requirements.txt

Reproduce: Other languages

 

Reproduce: Conda

 

  • Popular package manager
    • Takes also care of binaries and libraries
  • Bioconda: specific Bioinformatics recipes

Reproduce: Jupyter

  • Former IPython Notebook
  • Combines in a single notebook documentation (Markdown), comments and executable code with its output
  • Underlying notebook format is a JSON text file
    • Can be exported into PDF, HTML, etc.

Reproduce: Jupyter

  • Apart from Python (2 or 3), now also different languages with Kernels:
    • R, Perl5, Perl6, Javascript, more...
  • Additional widgets (e.g. for charts)
  • Convenient for sharing code and training
  • Jupyter gallery in Github

Reproduce: Docker

  • Allows shareable Linux systems that can be run in any machine where Docker is installed
  • Build images with a script file (Dockerfile), very similar to a Linux command-line script
  • Repository of Docker images
    • You can reuse, adapt, extend
    • Don't reinvent the wheel

Reproduce: Docker

  • Microservices principle
    • 1 Image -> n Containers -> n Services
    • n Services -> 1 full application
  • Example: BLAST Web application
    • Web server container
    • Database container
    • BLAST application running container
  • Making it work together:

Reproduce: Singularity

  • Like Docker but more suitable for HPC environments
  • No need of a Docker daemon running / less problematic for security
  • Docker images convertible into Singularity ones
  • Container images are files by default so they can be archived and moved more easily

 

Pipelines & Workflows

Guilty by association

Pipelines & Workflows: Why?

 

  • Write programs that do one thing and do it well.
  • Write programs to work together.
  • Write programs to handle text streams, because that is a universal interface.

Unix Philosophy

D. McIlroy, P.H.Salus

Pipelines & Workflows: How?

 

Pipelines and Workflows: Nextflow

 

  • Concepts
    • Processes
      • Any pipeline or program (in any language)
      • In local disk or in containers (Singularity, Docker)
    • Channels
      • FIFO queue
      • Normally files in a filesystem

Pipelines and Workflows: Nextflow

 

  • Concepts
    • Config files
      • Different config files, calling one to another can be created for adapting to different scenarios
    • Executors
      • Local machine
      • HPC cluster: SGE, Univa, SLURM, etc.
      • Cloud systems: AWS, Azure, Google Cloud

Diversity

There's more than one way to do it

Criteria

  • Kind of tasks
  • Team profiles
  • Infrastructure and privacy
  • Previous knowledge and time

Criteria: Tasks

  • Data Analysis
  • Interface / Web programming
  • Teaching/Training

 

 

  • Environment (where it can be achieved)
    • Interface/Web
    • HPC
    • etc.

Criteria: Profiles

  • Wet lab scientists
  • Statisticians, programmers
  • Citizens

 

 

  • Personal and working situations
    • Interns, PhD students, PostDocs
    • Technicians (full-time, temporary)
    • Project funding length

Criteria: Infrastructure, privacy

  • Data transfer
    • Cluster vs Cloud
  • Sysadmin or devops support
  • Human or clinical data involved
  • Funding vs time

 

 

Criteria: Knowledge

Questions?

Comments?