Open data science for early career scientists: the dhimmel chronology

By Daniel Himmelstein

Lecture for EPID 600

Data Science for Biomedical Informatics

1:00 – 2:30 pm, BRB 251

October 20, 2016

Slides at

Open Access

  • Science is publicly funded
  • Publication is meant to disseminate work
  • Subscription journals have a business model built on depriving access to scholarship
  • Journals don't pay authors, reviewers, and academic editors

Does copyright support continued availability and distribution?

Dissapearing decades: Amazon titles by decade

  • Works published prior to 1923 are public domain

Open access citation advantage

McKiernan et al. (2016) eLife

What if your work is submitted to a subscription journal?

  • Irreversible forfeiture of your intellectual property in dereliction of the public interest occurs when you sign the:
    • Copyright Transfer Agreement
    • License to Publish
  • Workarounds:
    • Preempt by publishing a preprint under an open license
    • Attempt to modify the agreement using the following techniques.

SPARC Addendum

Modify the copyright transfer agreement


From Meet the Robin Hood of Science by Simon Oxenham:

On the evening of November 9th, 1989, the Cold War came to a dramatic end with the fall of the Berlin Wall. Four years ago another wall began to crumble, a wall that arguably has as much impact on the world as the wall that divided East and West Germany. The wall in question is the network of paywalls that cuts off tens of thousands of students and researchers around the world, at institutions that can’t afford expensive journal subscriptions, from accessing scientific research.

In an age of philanthropic pirates (e.g. Sci-Hub), does open access matter?

Piracy is unviable for commercial or large-scale efforts 

Text & Data Mining

  • The total amount of literature is growing exponentially
  • Increasingly machines are reading the literature
  • Machines are in most cases restricted to the open access subset of the literature.
  • Machines lead to inlinks and citations






Definition: a draft of an article that has not yet been peer reviewed for formal publication


  • establish precedence
  • citeable & versioned
  • receive feedback
  • compatible with most publishers
  • attention & openness


Stanford's Biomedical Computation Review mentions our research and preprint 6 months before publication

Publishing Delays





March 26, 2015: my paper on Heterogeneous Network Edge Prediction is accepted to PLOS Computational Biology.





⌛⌛⌛⌛⌛⌛⌛             ⌛⌛⌛⌛

68 days

The history of publishing delays

Are publication delays getting shorter or longer? Kendall Powell, writing a feature for Nature News, contacted me. Her investigation had uncovered a widespread belief that delays were worsening with time. But she wanted data, and the existing data was field specific or anecdotal.

Time from submission to acceptance for 3,330,333 articles since 1965

Realtime Open Notebook Science

  • Hetnet of biology designed for drug repurposing
  • ~50 thousand nodes
    11 types (labels)
  • ~2.25 million relationships
    24 types
  • integrates 29 public resources
    knowledge from millions of studies
  • Use at
  • Predicted probability of treatment for ~200,000 compound-disease pairs (

Hetionet v1.0

Visualizing Hetionet v1.0

Does bupropion treat nicotine dependence?

  • Bupropion was first approved for depression in 1985
  • In 1997, bupropion was approved for smoking cessation
  • Can we predict this repurposing from Hetionet? The prediction was:

Network support that bupropion treats nicotine dependence.

Visualization by Antoine Lizee

I am a lawyer, but not your lawyer (or UC’s lawyer), and this isn’t legal advice.

― Katie Fortney

Others, like myself, try to remember to rate everything that I've read.

Lars Jensen

It took me a while to figure out.

Antoine Lizee

I would like to enter into the discussion the cases where there was a tough decision to be made

Pouya Khankhanian

117 Ratings

1504 Word Post

2776 Word Post




Nice of you to share this big network with everyone; however, I think you need to take care not to get yourself into legal trouble here. … 

I am not trying to cause trouble here — just the contrary. When making a meta-resource, licenses and copyright law are not something you can afford to ignore. I regularly leave out certain data sources from my resources for legal reasons.

One network to rule them all

We have completed an initial version of our network. …

Network existence (SHA256 checksum for graph.json.gz) is proven in Bitcoin block 369,898.

Discussion DOIs: bfmkbfmmbfmnbfmp

  • Hetionet integrates data from 29 resources
  • 12 had an open license
  • 9 had no license
  • Incompatibilities - Share Alike vs Non-Commercial
  • Requested permission for 11 resources
  • Median time to first reponse was 16 days
  • 2 affirmative responses
  • Removed MSigDB
  • "LICENSEE agrees not to put … the DATABASE on a … server … that may be accessed by any individual other than the LICENSEE."
  • LICENSEE agrees to provide … a written evaluation of the PROGRAM and the DATABASE, including a description of its functionality or problems and areas for further improvement

Legal barriers to data reuse


release data under an open license

Awareness of the looming data licensing crisis is growing.

  • Universities have conflicts of interest
  • Lack of education
  • Unresolved legal questions

Creative Commons Licenses

Computational Reproducibility

Error due to glmnet 2.0-2 versus 1.9-5

Environment where our analysis worked

While diagnosing the error, I found a bug that affected values in the manuscript

Control the Environment


Control packages

Control OS + packages

Docker for computational reproducibility

Beaulieu-Jones & Greene (2016) bioRxiv

Continuous analysis for automatic reproducibility

Beaulieu-Jones & Greene (2016) bioRxiv

Cognoma Contribution Counts

  • 11 repositories accross 4 teams
  • 50+ inviduals involved on GitHub
  • full stack data science

Pull requests for realtime review

Your Turn

Licensing Workshop

  1. Is the data subject to copyright? If no, end.
  2. Does the resource have a license?
  3. If no, contact the creators and inquire whether there license that allows reuse?
  4. If yes, does the license allow:
    • access to anyone
    • redistribution
    • modification
    • commercial reuse (does the license discriminate against any persons or groups)

At  the start of this class, every pupil was asked to list 3 databases / datasets / data resources that they have used in their research.

Report progress to

Open Data Science

By Daniel Himmelstein

Open Data Science

Guest lecture for the course Data Science for Biomedical Informatics (EPID 600) at the University of Pennsylvania. This course was instructed by Assistant Professor Blanca Himes. This presentation is released under a CC BY 4.0 License.

  • 3,637