Neha Moopen
Research Data Manager
2021-02-22 / GSLS OS Module
Best Practices in Writing Reproducible Code
@UtrechtUniversity
@NEONScience
&
Computational reproducibility is when detailed information is provided about code, software, hardware and implementation details (Victoria Stodden, 2014).
This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence. The image was obtained from https://zenodo.org/record/3332808.
Peng (2009) provides a useful distinction:
A study may be more or less reproducible than another depending on what data and code are made available (Peng, 2011).
This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence. The image was obtained from https://zenodo.org/record/3332808.
The stakeholders involved:
If you need more convincing...
Five selfish reasons to work reproducibly (Florian Markowetz, 2015)
DOCUMENTATION
ORGANIZATION
AUTOMATION
DISSEMINATION
You want yourself to understand how code written some time ago works
You want others to understand how to (re-)use your code
Explain your code with comments.
Explain what to install and how to get started in your README.
Explain in-depth use of your code in a notebook.
Comments are annotations you write directly in the code source.
They:
are written for users who deal with your source code
explain parts that are not intuitive from the code itself
do not replace readable or structured code
(in a specific structure) can be used to directly generate documentation for users.
Comic source: Geek & Poke
The README page is the first thing your user will see!
The contents typically include one or more of the following:
Reference: Wikipedia's README page
An example README:
HOW CAN YOU DO IT?
Address folder structure + file & folder naming:
Contain your project in a single recognizable folder.
Distinguish folder types, name them accordingly:
Source: Wilson et al. (2017)
HOW CAN YOU DO IT?
Address folder structure + file & folder naming:
File & Folder names should be:
Human-readable file names ->
<- File names that support sorting
source: OSF's File naming Guide
File organization should:
source: Intro to Reproducible Science @NEONScience
VERSION CONTROL!
(or other social coding platform):
synergistic with version control software git
makes history public and accessible (eek!)
allows publication of different releases
provides a platform for interaction and collaboration
FOR ARCHIVING RELEASES: ZENODO
WHY DO YOU NEED VERSION CONTROL?
It will help you manage your code most of your files (it is like track changes on steroids: it applies to all files in a folder).
It allows you to trace back your steps: if something breaks, you can figure out what happened.
NO MORE thesis_final_final_SERIOUSLYFINAL.Rmd
a good version control system allows you to collaborate and share!
a good version control system facilitates experimentation!
source: phdcomics.com
WHAT IS GIT?
Allows you to log updates, branch your work (so you can experiment without losing the original!), keep all backups, while efficiently using your storage
Gives the user a lot of control on what to track, and adds a narrative to changes ('commit comments')
DO: Commits should be atomic: comprehensive 'units' of changes.
DON'T: edit for a full day and put this in a single commit (or worse: forget to...)
Commits should have informative messages so you (and others) can trace your steps
Track most files; .gitignore those files you don't.
Explore new ideas with branches, keep a stable version on master
HOW TO GIT?
source: https://xkcd.com/1296/
HOW DO YOU DO IT?
SCRIPTING VS. POINT & CLICK
Script = more time spent up front, but will save time in the long run.
DRY & FUNCTIONALIZE EVERYTHING!
Don't Repeat Yourself: if your analysis is composed of scripts, with repeated code throughout, it will be more time consuming to maintain and update.
Instead use modularity: use functions to write code in reusable chunks
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
Functions are smaller code units reponsible of one task.
Functions are meant to be reused
Functions accept arguments (though they may also be empty!)
What arguments a function accept is defined by its parameters
Functions do not necessarily make code shorter (at first)! Compare:
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
It's better to think in building blocks:
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
WHY DO YOU NEED IT?
HOW DO YOU DO IT?
Document workflow: R Markdown / Jupyter Notebook
Collaborate with Colleagues / Version Control : GitHub
Publish Data Snapshot: FigShare, Dryad, Zenodo, etc
Share workflow: Notebook Viewer, Binder
Archive your project on Zenodo, and get a DOI!
GitHub & Zenodo have a great integration that makes it easy to archive a whole repository.
First, select your repository:
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
Second, release your project and follow the workflow:
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
Last,
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
As a final touch, take your DOI and place it as a badge in your GitHub README!
source: Best Practices in Writing Reproducible Code @UtrechtUniversity
You get more efficient, less redundant science: others can build upon our work!