ClouD Dataverse

Mercè Crosas, Institute for Quantitative Social Science, Harvard University

MOC Workshop, October 3, 2017, Boston University

Our Institute provides a technology Solution to Data Sharing  

Institute for Quantitative Social Science, Harvard University


An open-source software to share, cite, and find data.

Developed at Harvard's Institute for Quantitative Social Science

with the contribution of an active and growing community.

2006 (we started)


26 Dataverse installations serving hundreds of institutions

HOW Researchers SHare & Use data with dataverse

Harvard Dataverse Repository

A public repository for research data


> 70,000 datasets total
> 49,000 datasets uploaded to Harvard Dataverse repository

200 datasets/month


> 340,000 files

4,000 files/month


> 2.5 M downloads

60,000 downloads/month

Datasets Added


King, 1995, Replication, Replication

Altman and King, 2007, A Proposed Standard for the Scholarly Citation of Quantitative Data

Altman et al, 2001, A Digital Library for the Dissemination and Replication of Quantitative Social Science

King, 2007, An Introduction to the Dataverse Network as an Infrastructure for Data Sharing

Crosas, Honaker, King, Sweeney, 2015, Automating Open Science for Big Data

Crosas, 2012, The Dataverse Network: an open source application for sharing, discovering, and preserving research data

Altman and Crosas, 2013, The Evolution to Data Citation: from principles to implementation

Crosas, 2013, A Data Sharing Story

2014, Joint Declaration of Data Citation Principles

Pepe et al, 2014, How Do  Astronomers Share Data?

Goodman et al, 2014, Ten Simple Rules for the Care and Feeding of Scientific Data

Castro et al, 2015, Achieving Human and Machine Accessibility of Cited Data

Sweeney, Crosas, Bar-Sinai, 2015, Sharing Sensitive Data with Confidence: The DataTags System

Meyer et al.  2016, Data Publication with the  Structural Biology Data Grid Supports Live Analysis

Wilkinson et al, 2016, The FAIR Guiding Principles for Scientific Data Management and Stewardship

Bierer, Crosas, Pierce, 2017, Data Authorship as an Incentive to Data Sharing

Our Contributions to Enhance data sharing



Data should be ...

Wilkinson et al. , 2016, "The FAIR Guiding Principles for Scientific Data Management and Stewardship" Nature Scientific Data

FAIR DATA in Dataverse

Data Files


Data Licenses, User Agreements,


Data Citation with Persistent Identifier




Cloud Dataverse combines the power of cloud computing and storage with access to thousands of datasets from a feature-rich data repository platform

Why Cloud Dataverse?

  • Big Data should also be FAIR Data
  • Datasets are replicated to the Cloud for efficient access and reuse
  • Computing on a dataset is enabled directly from any repository

What we have built

  • Dataverse integration with Swift storage
  • Compute access to MOC from a dataset page in Dataverse
  • Temporary url to access restricted files in MOC

In progress

  • Replicate data from any Dataverse to Cloud Dataverse
  • Upload data directly in Swift; publish dataset from Swift to Dataverse


  • Implement Swift Access Control List (ACL) for file restriction
  • Support InCommon for MOC to use same credentials as in Dataverse

InTegration with other ProJects

Billion Object Platform

BIG GEODATA exploration and analytics

Data Provenance

track the original source of a Dataset

Pasquier, Lau, Trisovic, Boose, Coutierer, Crosas, Ellison, GIbson, Jones, Seltzer, 2017, If These Data Could Talk, Nature Scientific Data (Data Provenance examples from CERN and Harvard Forest)

Data Privacy

classify and handle datasets based on Their privacy level

Harvard Data Privacy Tools Project:

DataTags Project:





MOC Workshop - Cloud Dataverse

By Mercè Crosas

MOC Workshop - Cloud Dataverse

  • 2,070