Head of Data Integration at Related Sciences. Digital craftsman of the biodata revolution.
Open data science for early career scientists: the dhimmel chronology
By Daniel Himmelstein
Lecture for EPID 600
Data Science for Biomedical Informatics
1:00 – 2:30 pm, BRB 251
October 20, 2016
- Science is publicly funded
- Publication is meant to disseminate work
- Subscription journals have a business model built on depriving access to scholarship
- Journals don't pay authors, reviewers, and academic editors
Does copyright support continued availability and distribution?
Dissapearing decades: Amazon titles by decade
- Works published prior to 1923 are public domain
Open access citation advantage
McKiernan et al. (2016) eLife
What if your work is submitted to a subscription journal?
- Irreversible forfeiture of your intellectual property in dereliction of the public interest occurs when you sign the:
- Copyright Transfer Agreement
- License to Publish
- Preempt by publishing a preprint under an open license
- Attempt to modify the agreement using the following techniques.
Modify the copyright transfer agreement
From Meet the Robin Hood of Science by Simon Oxenham:
On the evening of November 9th, 1989, the Cold War came to a dramatic end with the fall of the Berlin Wall. Four years ago another wall began to crumble, a wall that arguably has as much impact on the world as the wall that divided East and West Germany. The wall in question is the network of paywalls that cuts off tens of thousands of students and researchers around the world, at institutions that can’t afford expensive journal subscriptions, from accessing scientific research.
In an age of philanthropic pirates (e.g. Sci-Hub), does open access matter?
Piracy is unviable for commercial or large-scale efforts
Text & Data Mining
- The total amount of literature is growing exponentially
- Increasingly machines are reading the literature
- Machines are in most cases restricted to the open access subset of the literature.
- Machines lead to inlinks and citations
Definition: a draft of an article that has not yet been peer reviewed for formal publication
- establish precedence
- citeable & versioned
- receive feedback
- compatible with most publishers
- attention & openness
Stanford's Biomedical Computation Review mentions our research and preprint 6 months before publication
March 26, 2015: my paper on Heterogeneous Network Edge Prediction is accepted to PLOS Computational Biology.
The history of publishing delays
Are publication delays getting shorter or longer? Kendall Powell, writing a feature for Nature News, contacted me. Her investigation had uncovered a widespread belief that delays were worsening with time. But she wanted data, and the existing data was field specific or anecdotal.
Time from submission to acceptance for 3,330,333 articles since 1965
Realtime Open Notebook Science
Visualizing Hetionet v1.0
Does bupropion treat nicotine dependence?
- Bupropion was first approved for depression in 1985
In 1997, bupropion was approved for smoking cessation
- Can we predict this repurposing from Hetionet? The prediction was:
- 99.5th percentile for nicotine dependence
- probability 2.50-fold greater than null
Network support that bupropion treats nicotine dependence.
Visualization by Antoine Lizee
I am a lawyer, but not your lawyer (or UC’s lawyer), and this isn’t legal advice.
Others, like myself, try to remember to rate everything that I've read.
It took me a while to figure out.
I would like to enter into the discussion the cases where there was a tough decision to be made
1504 Word Post
2776 Word Post
Nice of you to share this big network with everyone; however, I think you need to take care not to get yourself into legal trouble here. …
I am not trying to cause trouble here — just the contrary. When making a meta-resource, licenses and copyright law are not something you can afford to ignore. I regularly leave out certain data sources from my resources for legal reasons.
One network to rule them all
We have completed an initial version of our network. …
Network existence (SHA256 checksum for graph.json.gz) is proven in Bitcoin block 369,898.
- Hetionet integrates data from 29 resources
- 12 had an open license
- 9 had no license
Incompatibilities - Share Alike vs Non-Commercial
- Requested permission for 11 resources
- Median time to first reponse was 16 days
2 affirmative responses
- Removed MSigDB
- "LICENSEE agrees not to put … the DATABASE on a … server … that may be accessed by any individual other than the LICENSEE."
- LICENSEE agrees to provide … a written evaluation of the PROGRAM and the DATABASE, including a description of its functionality or problems and areas for further improvement
Legal barriers to data reuse
release data under an open license
Awareness of the looming data licensing crisis is growing.
- Universities have conflicts of interest
- Lack of education
- Unresolved legal questions
Creative Commons Licenses
See also opendefinition.org
dhimmel/elevcan: repository for Lung cancer incidence decreases with elevation: evidence for oxygen as an inhaled carcinogen
Control the Environment
Control OS + packages
Docker for computational reproducibility
Beaulieu-Jones & Greene (2016) bioRxiv
Continuous analysis for automatic reproducibility
Beaulieu-Jones & Greene (2016) bioRxiv
Cognoma Contribution Counts
- 11 repositories accross 4 teams
- 50+ inviduals involved on GitHub
- full stack data science
Pull requests for realtime review
- Is the data subject to copyright? If no, end.
- Does the resource have a license?
- If no, contact the creators and inquire whether there license that allows reuse?
- If yes, does the license allow:
- access to anyone
- commercial reuse (does the license discriminate against any persons or groups)
At the start of this class, every pupil was asked to list 3 databases / datasets / data resources that they have used in their research.
Report progress to git.io/vPQjW
Open Data Science
By Daniel Himmelstein