The Data Biologist Cookbook

Daniel Himmelstein

September 17, 2015

visualization evolved using electric sheep

License: all original content CC0, reused content attributed via hyperlink

the allegory of academia

rising journal subscription costs in the UK

  • major variability
  • opaque market
  • access is not gauranteed
  • In 2012, UCSF paid $7,733 for 2 physical copies of Nature

Does copyright support continued availability and distribution?

Dissapearing decades: Amazon titles by decade

  • Works published prior to 1923 are public domain

open access citation advantage

SPARC identified 70 studies: 46 yes versus 17 no




  • copyright  rights automatically granted creators of original work
  • copyright transfer — cornerstone of subscription publishing
  • no license — all rights reserved, copyright transfer

copyright and licensing






Definition: a draft of an article that has not yet been peer reviewed for formal publication


  • establish precedence
  • citeable & versioned
  • receive feedback
  • compatible with most publishers
  • attention & openness


Stanford's Biomedical Computation Review mentions our research and preprint 6 months before publication

data repositories

Domain specific repositories: GEO, dbGaP

open scientists

Image: The National Gallery of Art, Washington D.C.

Incumbent view:

I deserve compensation for any profit derived from my work.

Revolutionary view:

I will judge the success of my work by

  • the benefit it creates for others
  • the effect it has on the future

Bee sting pain index by body location

Michael Smith

Alexandra Elbakyan

"the goal is to collect all research papers ever published, and make them free" [source]

Satoshi Nakamoto

computational tools

The less effort, the faster and more powerful you will be.

— Bruce Lee

Calligraphy: sho-shin beginner's mind by Hidy Ochiai

  • DOI: digital object identifier
  • Format: 10.registrant/identifier
    Example: 10.7717/peerj.705
  • url:
  • Issued by 9 agencies
    crossref, datacite
  • metadata
  • citation

plain text editor

# markdown introduction

markdown was proposed by [John Gruber in 2004](
It is:

+ formatting language
+ no boilerplate
+ readable in plain text
+ supports *italics*, **bold**, and `code`
+ `.md` extension

Three websites that use markdown are:

1. GitHub
+ reddit
+ Trello
>>> import phd
  • general
  • powerful
  • elegant
  • 3 > 2
  • conda + pip
  • testing (pytest)
  • practice zen

See R best practices for more

  • statistics

  • bioinformatics

  • hadleyverse
    • dplyr, tidyr — data manipulation
    • readr, readxl — io
    • ggplot2 — plotting
    • devtools, roxygen2 — development
  • you will pay in blood for using R
    see dhimmel/snplentiful#3


Speed of C, abstraction of Python

For a doctoral example, see CauseMap by Cyrus Maher

  • version control
  • distributed
  • cryptographic history
  • non-linear development
  • accelerate your science
  • social
  • collaborative
  • commit specific links
  • reproducibility


  • clarify issues up front
  • don't be afraid to ask
  • do not expect understanding without communication

build your own brand

how to byob: identity

  • onymity, pseudonymity, anonymity
  • handle
  • claim your accounts
    • orcid
    • twitter
    • github
  • avatar
  • style
    • vocabulary
    • visual theme
    • dialect
    • humor

personal website

Clint Cario

Daniel Himmelstein

Andew Sczesnak

Kieran Mace

Kyle Barlow

Svetlana Lyalina

how to BYOB


  • create a public work
  • software package
  • tutorial


  • meetups
  • conferences
  • online
  • twitter


  • publons
  • pubmed commons

Painting on 9' × 5' plywood by Sam Tarling

art of presentation

  • simplicity
  • text
  • multimedia
  • html
  • fragments
  • simplicity
  • text
  • multimedia
  • html
  • fragments

art of presentation


the beginning

Data Biologist Cookbook

By Daniel Himmelstein

Data Biologist Cookbook

Presentation to the incoming iPQB students at UCSF on September 17, 2015.

  • 3,909