Data sharing For better science
Mercè Crosas, Institute for Quantitative Social Science, Harvard University
@mercecrosas
max Planck institute for Radioastronomy, September 12, 2017
The Dataverse ProJect
This Talk
-
Importance of Data Sharing
- Reproducibility to verify science
- Reuse to advance science and evidence-based policy
-
Enabling Data Sharing
- Data Policies from journals and funding agencies
- Data Citation to find datasets, give credit to data authors
- Data Repositories as publishers of data
Data Sharing,
DATA PUBLISHING
Data sharing is "the release of research data, associated metadata, accompanying documentation, and software code for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way."
Data Publishing Group, 201 5
Nullius In Verba
"take NoBoDY's WORD For IT"
(motto of the Royal Society, founded in 1660,
launched first scientific journal in 1665)
Since the Beginning of Modern Science ...
University of California Curation Center, DataPub blog, August 2017
Reproducibility and Replication
(by the National Science Foundation):
The ability of a researcher to duplicate the results of a prior study
... using the same materials and procedures used by the original investigator. (reproducibility)
... if the same procedures are followed but new data are collected. (replication)
Empirical, Computational, and Statistical Reproducibility (Stodden, 2014):
Empirical: data and collection details are made freely available
Computational: code, software, hardware and implementations details are provided
Statistical: details on choice of statistics tests, model parameters are provided
Reproducibility
Reproducibility Crisis?
Trust, but verify
6 (11%) out of 53 landmark cancer biology studies could be reproduced.
39 out of 100 psychology studies could be reproduced.
Washington Post, Joel Achenbach, August 28, 2015
Nature, 2016, "1,500 scientists lift the lid on reproducibility", vol 533, Issue 734
Nature's survey of 1,576 researchers:
703 Biology
106 Chemistry
95 Earth and Environmental
203 Medicine
236 Physics and Engineering
233 Other
"In the Wf4Ever project we propose to improve the quality of science with metrics based on reproducibility and reuse, preserving decomposable thoroughly curated digital artefacts that enhances reproducibility and visibility of the experiment, as well as allowing more accurate mechanisms for credit attribution."
ShaRing Data, Code, and workflows Facilitates reproducibility and Reduces Duplication
BUT DATA SHARING is more than Posting your data in your website
External links in all articles published between 1997 and 2008 in the four main astronomy journals published by the American Astronomical Society.
More than half of links to Data in articles from 15 years ago are broken
70% of Links to personal websites from articles published in 1997 are Broken
how CAN we Improve
data sharing?
- New Norms
- New Incentives
- New Technology
Castro, Crosas, Garnett, Sheridan, Altman, 2017, Journal of Scholarly Publishing
FORMAL DATA-SHARING POLICIES ARE APPLIED IN JOURNALS ACROSS DISCIPLINEs
MANY Funders require data sharing & Open data
PRIVATE RESEARCH FUNDERS
- Bill and Melinda Gates Foundation Information Sharing Approach
- Sloan Foundation Data Sharing Policy
- Wellcome Trust Data Sharing Policy
- Arnold Foundation
- Moore Foundation
- Robert Wood Johnson Foundation
- HHMI Policy on the Sharing of Publication-Related Materials, Data and Software
PUBLIC RESEARCH FUNDERS
- Department of Agriculture
- Department of Commerce
- Department of Defense
- Department of Education
- Department of Energy
- Department of Health and Human Services
- Agency for Healthcare Research and Quality (AHRQ)
- Assistant Secretary for Preparedness and Response (ASPR)
- Center for Disease Control and Prevention (CDC)
- Food and Drug Administration (FDA)
- National Institutes of Health (NIH)
- Department of Homeland Security
- Department of Housing and Urban Development
- Department of Interior
- Department of Labor
- Department of Transportation
- Department of Veterans Affairs
- Environmental Protection Agency (EPA)
" We believe that both as a matter of fairness and as a matter of providing an incentive for data sharing, the persons who initially gathered the data should receive appropriate and standardized credit that can be used for academic advancement, for grant applications, and in broader situations."
From 10,555 studies with gene expression microarray data:
- Studies that shared data received 9% more citations
- Data reuse by other researchers continued for 6 years
Data sharing increases citations
Piwowar and Vision (2013), Data reuse and the open data citation advantage. PeerJ 1:e175; DOI 10.7717/peerj.175
Our Institute provides a technology Solution to Data Sharing
An open-source software to share, cite, and find data.
Developed at Harvard's Institute for Quantitative Social Science
2006 (we started)
2017
dataverse.org
HOW Researchers SHare & Use data with dataverse
Harvard Dataverse Repository
> 70,000 datasets total
> 49,000 datasets uploaded to Harvard Dataverse repository
200 datasets/month
> 340,000 files
4,000 files/month
> 2.5 M downloads
60,000 downloads/month
Datasets Added
Downloads
dataverse.harvard.edu
King, 1995, Replication, Replication
Altman and King, 2007, A Proposed Standard for the Scholarly Citation of Quantitative Data
Altman et al, 2001, A Digital Library for the Dissemination and Replication of Quantitative Social Science
King, 2007, An Introduction to the Dataverse Network as an Infrastructure for Data Sharing
Crosas, Honaker, King, Sweeney, 2015, Automating Open Science for Big Data
Crosas, 2012, The Dataverse Network: an open source application for sharing, discovering, and preserving research data
Altman and Crosas, 2013, The Evolution to Data Citation: from principles to implementation
Crosas, 2013, A Data Sharing Story
2014, Joint Declaration of Data Citation Principles
Pepe et al, 2014, How Do Astronomers Share Data?
Goodman et al, 2014, Ten Simple Rules for the Care and Feeding of Scientific Data
Castro et al, 2015, Achieving Human and Machine Accessibility of Cited Data
Sweeney, Crosas, Bar-Sinai, 2015, Sharing Sensitive Data with Confidence: The DataTags System
Meyer et al. 2016, Data Publication with the Structural Biology Data Grid Supports Live Analysis
Wilkinson et al, 2016, The FAIR Guiding Principles for Scientific Data Management and Stewardship
Bierer, Crosas, Pierce, 2017, Data Authorship as an Incentive to Data Sharing
Our Contributions to Enhance data sharing
2017
Findable
Accessible
Interpoperable
Reusable
Data should be ...
Wilkinson et al. , 2016, "The FAIR Guiding Principles for Scientific Data Management and Stewardship" Nature Scientific Data
FAIR DATA in Dataverse
Data Files
Metadata
Data Licenses, User Agreements
Dataset Versions
Data Citation with Persistent Identifier (DOI)
A Dataverse is a container of Datasets and a Dataset is a container of data files, documentation, and code
Dataverse supports Astronomy Data
- Supports default astronomy metadata fields (based on virtual observatory schema)
- Extracts header metadata from FITS files upon ingest
Dataverse used by Max-Planck Institute ...
Dataverse in the astronomy news ...
More than 12,000 downloads!
What are we working on NOW?
Data Provenance
track the original source of a Dataset
Pasquier, Lau, Trisovic, Boose, Coutierer, Crosas, Ellison, GIbson, Jones, Seltzer, 2017, If These Data Could Talk, Nature Scientific Data (Data Provenance examples from CERN and Harvard Forest)
ClouD Dataverse
Combine data repositories with Cloud computing
Data Privacy
classify and handle datasets based on Their privacy level
Harvard Data Privacy Tools Project: privacytools.seas.harvard.edu
DataTags Project: datatags.org
INTEGRATION WITH TOOLS
Dataverse as part of the data lifecycle
Dataverse Community
49 software contributors
BI-WEEKLy Community Calls
235 ATTENDEES
26 ORGANIZATIONS/UNIVERSITIES
11 countries
AnNual Community Meeting
Next: June 13, 14, 15, 2018
Text
Thanks
@mercecrosas
scholar.harvard.edu/mercecrosas
dataverse.org
Talk for Max Planck Institute
By Mercè Crosas
Talk for Max Planck Institute
- 1,798