Electronic Lab Notebooks:
Recording Computational Work

Daniel Himmelstein (@dhimmel)

Biomedical Graduate Studies Orientation

University of Pennsylvania

BRB Auditorium

August 22, 2019 at 2:00 PM

slides.com/dhimmel/bgs

slides released under CC BY 4.0

Greene Lab

http://www.greenelab.com/

Outline:

  • files (especially text files) and directories

  • computational notebooks (e.g. Jupyter or R Markdown)

  • version control

  • code review on gitlab / github

  • open source

  • manubot

Outline

files & directories

ISO 8601

2019-08-22

2019-08-22_bgs-presentation.pdf
2019-08-23_all-of-my-secrets.txt

how to name files?

01.download-data.ipynb
02.process-data.ipynb
03.visualize-data.ipynb

100% recordable

Why is computational research unique?

notebooks

  • restart & run all

  • single script to run entire pipeline

caution

version control

git log \
  --pretty=short \
  --abbrev-commit

code review

versioned environments

dhimmel/elevcan: repository for "Lung cancer incidence decreases with elevation: evidence for oxygen as an inhaled carcinogen"

Error due to glmnet 2.0-2 versus 1.9-5

conquer your environment

conda

Control packages

Control OS + packages

open source

convert rms-fsf-slide-propreitary.png -channel RGB -negate -transparent black rms-fsf-slide-propreitary-negated.png

FreeSoftware TEDx slides. (2014) Reused under CC BY 3.0

proprietary software:
the software controls the science

FreeSoftware TEDx slides. (2014) Reused under CC BY 3.0

convert rms-fsf-slide.png -channel RGB -negate -transparent black rms-fsf-slide-negated.png

open source software:
the scientist controls the software

by default, scientific outputs subject to copyright

sometimes universities place additional legal barriers to reuse 

Recommendations:

  1. release data under an open license
  2. University researchers: commit to open in your resource sharing plan

manubot

Beyond the PDF First Day Notes

By De Jongens van de Tekeningen

Licensed under CC BY 3.0

Modified to invert colors

citation by persistent identifier

This is a sentence with 5 citations [
  @doi:10.1038/nbt.3780;
  @pmid:29424689;
  @pmcid:PMC5938574;
  @arxiv:1407.3561;
  @url:https://greenelab.github.io/meta-review/
].

References

  1. Reproducibility of computational workflows is automated using continuous analysis
    Brett K Beaulieu-Jones, Casey S Greene
    Nature Biotechnology (2017-03-13) https://doi.org/f9ttx6
    DOI: 10.1038/nbt.3780 · PMID: 28288103 · PMCID: PMC6103790
     
  2. Sci-Hub provides access to nearly all scholarly literature.
    Daniel S Himmelstein, Ariel Rodriguez Romero, Jacob G Levernier, Thomas Anthony Munro, Stephen Reid McLaughlin, Bastian Greshake Tzovaras, Casey S Greene
    eLife (2018-03-01) https://www.ncbi.nlm.nih.gov/pubmed/29424689
    DOI: 10.7554/elife.32822 · PMID: 29424689 · PMCID: PMC5832410
     
  3. Opportunities and obstacles for deep learning in biology and medicine
    Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, … Casey S. Greene
    Journal of the Royal Society Interface (2018-04) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5938574/
    DOI: 10.1098/rsif.2017.0387 · PMID: 29618526 · PMCID: PMC5938574
     
  4. IPFS - Content Addressed, Versioned, P2P File System
    Juan Benet
    arXiv (2014-07-14) https://arxiv.org/abs/1407.3561v1
     
  5. Open collaborative writing with Manubot
    Daniel S. Himmelstein, David R. Slochower, Venkat S. Malladi, Casey S. Greene, Anthony Gitter
    (2018-08-03) https://greenelab.github.io/meta-review/
This is a sentence with 5 citations [1,2,3,4,5].

https://manubot.org/catalog/

AMA!

@dhimmel

0000-0002-3012-7446

Slides
https://slides.com/dhimmel/bgs

Made with Slides.com