Mercè Crosas, Institute for Quantitative Social Science, Harvard University
"Every two years, the amount of digitized data is equal to all of the data ever collected before. The world’s knowledge is at our fingertips, and data science allows us to effectively and efficiently make use of that knowledge. This is facilitating a societal shift as big as the Industrial Revolution. "
Data Science Director, UVA
Former Associate Director for Data Science, NIH
UVAToday Q&A, August 21, 2017
Data sharing is "the release of research data, associated metadata, accompanying documentation, and software code for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way."
Data Publishing Group, 201 5
(motto of the Royal Society, founded in 1660,
launched first scientific journal in 1665)
Since the Beginning of Modern Science ...
University of California Curation Center, DataPub blog, August 2017
The ability of a researcher to duplicate the results of a prior study
... using the same materials and procedures used by the original investigator. (reproducibility)
... if the same procedures are followed but new data are collected. (replication)
Empirical: data and collection details are made freely available
Computational: code, software, hardware and implementations details are provided
Statistical: details on choice of statistics tests, model parameters are provided
Begley & Ellis, Nature, 2012
(from Scientists at Amgen biotechnology)
21% of Literature Data are in line with in-house data
Prinz, Schlange, Asadullah, 2011, Nature Reviews Drug Discovery
Aims to release results by end of 2017
Independently replicating a subset of experimental results from 50 high-profile papers in the field of cancer biology published between 2010-2012
Lab released the first publicly available Ebola sequences (on GenBank), and clinical data (on Harvard Dataverse).
"We were amazed by the surge of collaboration that followed"
Yozwiak, Shaffner, Sabeti, 2015 "Make Outbreak Research Open Access" Nature
Gaps in data sharing during the pike of the Ebola outbreak
Image source: Andres Colubri, Sabeti Lab
Data published at Harvard Dataverse
Bicycle data released by BARI was the centerpiece of Boston Mayor's Bike Safety Report
Castro, Crosas, Garnett, Sheridan, Altman, 2017, Journal of Scholarly Publishing
" We believe that both as a matter of fairness and as a matter of providing an incentive for data sharing, the persons who initially gathered the data should receive appropriate and standardized credit that can be used for academic advancement, for grant applications, and in broader situations."
An open-source software to share, cite, and find data.
Developed at Harvard's Institute for Quantitative Social Science
2006 (we started)
Harvard Dataverse Repository
> 70,000 datasets total
> 49,000 datasets uploaded to Harvard Dataverse repository
> 340,000 files
> 2.5 M downloads
King, 1995, Replication, Replication
Altman and King, 2007, A Proposed Standard for the Scholarly Citation of Quantitative Data
Altman et al, 2001, A Digital Library for the Dissemination and Replication of Quantitative Social Science
King, 2007, An Introduction to the Dataverse Network as an Infrastructure for Data Sharing
Crosas, Honaker, King, Sweeney, 2015, Automating Open Science for Big Data
Crosas, 2012, The Dataverse Network: an open source application for sharing, discovering, and preserving research data
Altman and Crosas, 2013, The Evolution to Data Citation: from principles to implementation
Crosas, 2013, A Data Sharing Story
2014, Joint Declaration of Data Citation Principles
Pepe et al, 2014, How Do Astronomers Share Data?
Goodman et al, 2014, Ten Simple Rules for the Care and Feeding of Scientific Data
Castro et al, 2015, Achieving Human and Machine Accessibility of Cited Data
Sweeney, Crosas, Bar-Sinai, 2015, Sharing Sensitive Data with Confidence: The DataTags System
Meyer et al. 2016, Data Publication with the Structural Biology Data Grid Supports Live Analysis
Wilkinson et al, 2016, The FAIR Guiding Principles for Scientific Data Management and Stewardship
Bierer, Crosas, Pierce, 2017, Data Authorship as an Incentive to Data Sharing
Data should be ...
Wilkinson et al. , 2016, "The FAIR Guiding Principles for Scientific Data Management and Stewardship" Nature Scientific Data
Data Licenses, User Agreements
Data Citation with Persistent Identifier (DOI)
Harvard Data Privacy Tools Project: privacytools.seas.harvard.edu
DataTags Project: datatags.org
Pasquier, Lau, Trisovic, Boose, Coutierer, Crosas, Ellison, GIbson, Jones, Seltzer, 2017, If These Data Could Talk, Nature Scientific Data