Open Science.
Good practices in Bioinformatics
Open Science
Document
Write it down or ...
it didn't happen!
Document: Why?
- Organise ideas
- Understanding code and steps in the future for you and others
- Fixing errors
- Help in future publication
Document: Where?
- File System (e.g. README or TODO files)
-
Control Version System
- Git, SVN, etc.
-
Content Management System
- Wiki CMS, Drupal, etc.
Document: How?
- Plain text
- Format
- Unstructured
- Free
- Markdown
- Wikitext
- Unstructured
Document: How?
-
Format
-
Structured
-
Config files
- XML, JSON, INI, YAML
- Templates (e.g. in wikis)
- Database Management Systems (Relation or NoSQL)
-
Config files
-
Structured
Tag and track
I never said so!
Tag and track: Why?
- Convenient backup
- Error tracking and reversion
- Checking history
- Allowing collaboration on different time points
- Publication of specific snapshots
Tag and track: Where?
-
Code, documentation:
- Control Version System (Git, SVN, etc.)
- Wiki CMS (e.g. [Semantic] MediaWiki)
-
Data, files
- Plain Git (small files) or Git with large files
- Document Management Systems
Tag and track: Concepts
- Revision, Version, Commit
- Branch
- Tag, Release
- Fork, Pull request
Tag and track: Publish
-
Working and executable code
- Docker & Singularity hubs
-
Identify Content & Code (DOI)
- Figshare
- Zenodo (with Github)
-
Bio specific repositories
- Sequence Read Archive (SRA)
- GEO Archive (Genome Expression Data)
- ENA, EGA and others. Detail
Reproduce
Run it again, Sam!
Reproduce: Why?
- Nowadays not only textual statements but also code and data
-
Peers and collaborators should be able to reproduce by themselves
- Check errors
- Improve code, data
- Test in different conditions
Reproduce: How?
- Code requirements, recipes
-
Virtualisation
- Hypervisor: VirtualBox, VMWare, etc.
- Containers: Docker, Singularity
Reproduce: Note on python
-
pyenv & pyenv-virtualenv
-
pyenv install x.y.z
-
pyenv virtualenv x.y.x myvenv
-
-
pip
-
pip freeze > requirements.txt
-
pip install -r requirements.txt
-
Reproduce: Other languages
Reproduce: Conda
-
Popular package manager
- Takes care also of binaries, libraries
- Bioconda: specific Bioinformatics recipes
Reproduce: Jupyter
- Former IPython Notebook
- Combines in a single notebook documentation (Markdown), comments and executable code with its output
-
Underlying notebook format is a JSON text file
- Can be exported into PDF, HTML, etc.
Reproduce: Jupyter
-
Apart from Python (2 or 3), now also different languages with Kernels:
- R, Perl5, Perl6, Javascript, more...
- Additional widgets (e.g. for charts)
- Convenient for sharing code and training
- Jupyter gallery in Github
Reproduce: Docker
- Allows shareable Linux systems that can be run in any machine were Docker is installed
- Build images with a script file (Dockerfile), very similar to a Linux command-line script
-
Repository of Docker images
- You can reuse, adapt, extend
- Don't reinvent the wheel
Reproduce: Docker
-
Microservices principle
- 1 Image -> n Containers -> n Services
- n Services -> 1 full application
-
Example: BLAST Web application
- Web server container
- Database container
- BLAST application running container
-
Making it work together:
- system scripts
- Docker compose
- etc.
Reproduce: Singularity
- Like Docker but more suitable for HPC environments
- No need of a Docker daemon running / less problematic for security
- Docker images convertible into Singularity ones
- Singularity Repository
Pipelines & Workflows
Guilty by association
Pipelines & Workflows: Why?
- Write programs that do one thing and do it well.
- Write programs to work together.
- Write programs to handle text streams, because that is a universal interface.
Unix Philosophy
D. McIlroy, P.H.Salus
Pipelines & Workflows: How?
- Traditionally from Shell script files
-
Frameworks or applications
- Web-based
- GUI and command-line
- Command-line
- Common Workflow Language
Pipelines and Workflows: Nextflow
-
Concepts
-
Processes
- Any pipeline or program (in any language)
- In local disk or in containers (Singularity, Docker)
-
Channels
- FIFO queue
- Normally files in a filesystem
-
Processes
Pipelines and Workflows: Nextflow
-
Concepts
-
Config files
- Different config files, calling one to another can be created for adapting to different scenarios
-
Executors
- Local machine
- HPC cluster: SGE, Univa, SLURM, etc.
- Cloud systems: Amazon Cloud, Apache Ignite
-
Config files
Questions?
Comments?
Diversity
There's more than one way to do it
Criteria
- Kind of tasks
- Team profiles
- Infrastructure and privacy
- Previous knowledge and time
Criteria: Tasks
- Data Analysis
- Interface / Web programming
- Teaching/Training
- Environment (where can be acheived)
- Interface/Web
- HPC
- etc.
Criteria: Profiles
- Wet lab scientists
- Statisticians, programmers
- Citizens
- Personal and working situations
- Interns, PhD students, PostDocs
- Technicians (full-time, temporary)
- Project funding length
Criteria: Infrastructure, privacy
- Data transfer
- Cluster vs Cloud
- Sysadmin or devops support
- Human or clinical data involved
- Funding vs time
Criteria: Knowledge
- Programming language(s)
- Python, R, JavaScript, Java, Perl
- Availability of libraries / reusing
- Frameworks, platforms
Open Science. Good practices in Bioinformatics
By Similis.cc
Open Science. Good practices in Bioinformatics
A short tutorial about good practices and tools in Bioinformatics focused on Open Science
- 2,296