Data sharing For better science and better Health:

Mercè Crosas, Institute for Quantitative Social Science, Harvard University
@mercecrosas

XXXVII Jornadas de Economía de la Salud, Barcelona, 6-8 SePtiembre

The Dataverse ProJect

This Talk

  • Importance of Data Sharing
    • Reproducibility to verify science
    • Reuse to advance science and evidence-based policy
  • Enabling Data Sharing
    • Data Policies from journals and funding agencies
    • Data Citation to find datasets, give credit to data authors
    • Data Repositories as publishers of data

Data Science,
Big Data

"Every two years, the amount of digitized data is equal to all of the data ever collected before. The world’s knowledge is at our fingertips, and data science allows us to effectively and efficiently make use of that knowledge. This is facilitating a societal shift as big as the Industrial Revolution. "

 

 

Phil Bourne
Data Science Director, UVA
Former Associate Director for Data Science, NIH

UVAToday Q&A, August 21, 2017

Data Sharing,
DATA PUBLISHING

Data sharing is "the release of research data, associated metadata, accompanying documentation, and software code for re-use and analysis in such a manner that they can be discovered on the Web and referred to in a unique and persistent way."

 Data Publishing Group, 201 5

Nullius In Verba

"take NoBoDY's WORD For IT"

(motto of the Royal Society, founded in 1660,

launched first scientific journal in 1665)

Since the Beginning of Modern Science ...

University of California Curation Center, DataPub blog, August 2017

Reproducibility and Replication
(by the National Science Foundation):

The ability of a researcher to duplicate the results of a prior study 

... using the same materials and procedures used by the original investigator. (reproducibility)

... if the same procedures are followed but new data are collected. (replication)

 

Empirical, Computational, and Statistical Reproducibility (Stodden, 2014):

Empirical: data and collection details are made freely available

Computational: code, software, hardware and implementations details are provided

Statistical: details on choice of statistics tests, model parameters are provided

Reproducibility

Reproducibility in
Cancer Studies

ONLY 6 (11%) out OF 53 Landmark Studies That CLaim to TREAT CANCER COULD be Reproduced

Begley & Ellis, Nature, 2012

(from Scientists at Amgen biotechnology)

Study from Bayer: ONLY 14(21%) of 67 Publications could be reproduced 

21% of Literature Data are in line with in-house data

Prinz, Schlange, Asadullah, 2011, Nature Reviews Drug Discovery

Aims to release results by end of 2017

 

 

Independently replicating a subset of experimental results from 50 high-profile papers in the field of cancer biology published between 2010-2012

 

DATA SHARING TO ADVANCE SCIENCE AND POLICY MAKING

  • Outbreak Data
  • City Data

SABETI LAB SHARed EBOLA Data during Outbreak

Lab released the first publicly available Ebola sequences (on GenBank), and clinical data (on Harvard Dataverse).

"We were amazed by the surge of collaboration that followed"

Yozwiak, Shaffner, Sabeti, 2015 "Make Outbreak Research Open Access" Nature

But, We don't share data often Enough

Gaps in data sharing during the pike of the Ebola outbreak

Why data sharing doesn't happen more often DURING OUtbreaks?

One reason is concern about
patient privacy

Image source: Andres Colubri, Sabeti Lab

bICYCLE ACCIDENT DATA Publicly Released: collisions in Boston, 2009-2012

Data published at Harvard Dataverse

Bicycle data released by BARI was the centerpiece of Boston Mayor's Bike Safety Report

how CAN we Increase data sharing?

  • New Norms
  • New Incentives
  • New Technology

Castro, Crosas, Garnett, Sheridan, Altman, 2017, Journal of Scholarly Publishing

Journal DATA POLICIES APPLIED ACROSS DISCIPLINEs

MANY Funders require data sharing & Open data

PRIVATE RESEARCH FUNDERS

  • Bill and Melinda Gates Foundation Information Sharing Approach
  • Sloan Foundation Data Sharing Policy
  • Wellcome Trust Data Sharing Policy
  • Arnold Foundation
  • Moore Foundation
  • Robert Wood Johnson Foundation
  • HHMI Policy on the Sharing of Publication-Related Materials, Data and Software

 

PUBLIC RESEARCH FUNDERS

  • Department of Agriculture
  • Department of Commerce
  • Department of Defense
  • Department of Education
  • Department of Energy
  • Department of Health and Human Services
    • Agency for Healthcare Research and Quality (AHRQ)
    • Assistant Secretary for Preparedness and Response (ASPR)
    • Center for Disease Control and Prevention (CDC)
    • Food and Drug Administration (FDA)
    • National Institutes of Health (NIH)
  • Department of Homeland Security
  • Department of Housing and Urban Development
  • Department of Interior
  • Department of Labor
  • Department of Transportation
  • Department of Veterans Affairs
  • Environmental Protection Agency (EPA)

 

" We believe that both as a matter of fairness and as a matter of providing an incentive for data sharing, the persons who initially gathered the data should receive appropriate and standardized credit that can be used for academic advancement, for grant applications, and in broader situations."

 

Our Institute provides a technology Solution to Data Sharing  

An open-source software to share, cite, and find data.

Developed at Harvard's Institute for Quantitative Social Science

2006 (we started)

2017

dataverse.org

HOW Researchers SHare & Use data with dataverse

 

Harvard Dataverse Repository

 

> 70,000 datasets total
> 49,000 datasets uploaded to Harvard Dataverse repository

200 datasets/month

 

> 340,000 files

4,000 files/month

 

> 2.5 M downloads

60,000 downloads/month

 

Datasets Added

Downloads

dataverse.harvard.edu

King, 1995, Replication, Replication

Altman and King, 2007, A Proposed Standard for the Scholarly Citation of Quantitative Data

Altman et al, 2001, A Digital Library for the Dissemination and Replication of Quantitative Social Science

King, 2007, An Introduction to the Dataverse Network as an Infrastructure for Data Sharing

Crosas, Honaker, King, Sweeney, 2015, Automating Open Science for Big Data

Crosas, 2012, The Dataverse Network: an open source application for sharing, discovering, and preserving research data

Altman and Crosas, 2013, The Evolution to Data Citation: from principles to implementation

Crosas, 2013, A Data Sharing Story

2014, Joint Declaration of Data Citation Principles

Pepe et al, 2014, How Do  Astronomers Share Data?

Goodman et al, 2014, Ten Simple Rules for the Care and Feeding of Scientific Data

Castro et al, 2015, Achieving Human and Machine Accessibility of Cited Data

Sweeney, Crosas, Bar-Sinai, 2015, Sharing Sensitive Data with Confidence: The DataTags System

Meyer et al.  2016, Data Publication with the  Structural Biology Data Grid Supports Live Analysis

Wilkinson et al, 2016, The FAIR Guiding Principles for Scientific Data Management and Stewardship

Bierer, Crosas, Pierce, 2017, Data Authorship as an Incentive to Data Sharing

Our Contributions to Enhance data sharing

2017

Findable
Accessible
Interpoperable
Reusable

Data should be ...

Wilkinson et al. , 2016, "The FAIR Guiding Principles for Scientific Data Management and Stewardship" Nature Scientific Data

FAIR DATA in Dataverse

Data Files

Metadata

Data Licenses, User Agreements

Dataset Versions

Data Citation with Persistent Identifier (DOI)

What are we working on NOW?

Data Privacy

classify and handle datasets based on Their privacy level

Harvard Data Privacy Tools Project: privacytools.seas.harvard.edu

DataTags Project: datatags.org

ClouD Dataverse

Combine data repositories with Cloud computing

Data Provenance

track the original source of a Dataset

Pasquier, Lau, Trisovic, Boose, Coutierer, Crosas, Ellison, GIbson, Jones, Seltzer, 2017, If These Data Could Talk, Nature Scientific Data

INTEGRATION WITH TOOLS

Dataverse as part of the data lifecycle

 

Dataverse  Community

 

 

49 software contributors

BI-WEEKLy Community Calls

 

235 ATTENDEES
26 ORGANIZATIONS/UNIVERSITIES
11 countries

AnNual Community Meeting

Next: June 13, 14, 15, 2018

Text

Thanks

@mercecrosas

scholar.harvard.edu/mercecrosas

dataverse.org

Made with Slides.com