Dataverse

Mercè Crosas, IQSS, Harvard University   @mercecrosas

DATAVERSE IS A REPOSITORY

for finding, citing, and publishing data 

 

Dataverse is A Platform

for building your own data repository

 

DATAVERSE IS A Community

which facilitates data access and data sharing around the world

Community
Features
Data

Projects 

A Growing, ENGAGED  Community 

Dataverse.ORG

33 Dataverse Repositories sites around the world

Dataverse Community GROWTH 

2006

Dataverse Development starts at Harvard's Institute for Quantitative Social Science (IQSS)

2
Dataverse sites

2015

14
Dataverse sites 

2017

23
Dataverse sites 

2018

33
Dataverse sites 

2016

18

Dataverse sites 

First Annual Dataverse Community Meeting

4 developers

First release

74 contributors

30 releases

12,807 commits

Global Dataverse Community Consortium

In 2018, a new international consortium is formed to support and coordinate efforts across Dataverse Repositories.

http://dataversecommunity.global (coming soon!)

Building an Active, Engaged Community WITH:

  • Transparency and Common Knowledge
  • Process and Tools
  • Human Touch

TransParency and common Knowledge

 

  • High-level goals and roadmap in dataverse.org site
  • Development status in Waffle
  • Issues discussions in GitHub
  • General discussions in Google Groups (mailing list)

Dataverse Waffle Board

+

Dataverse.org 

Process to Support AN Agile DevElOpment

Engage early with contributors on technical design and user testing:

  1. Pull Request 
  2. Code Review 
  3. QA 
  4. Release

 

Tools:

  • Waffle 
  • GitHub
  • Google groups
  • irc
  • Slack 

THE Human touch

  • Annual Community Meeting:
    • ~150 people
    • Organizations from ~ 15 countries 
  • Quick reply to mailing list (Google groups) and IRC
  • Biweekly Call (last year):
    • 23 Calls
    • 228 Participants
    • 18 Organizations

Dataverse World Cup!

Community
Features
Data

Projects 

A RICH Set of USER-Friendly Features 

Data Citation:

Credit as an incentive to Share Data

  • A formal data citation automatically generated
  • Attribution to data creators and data providers 
  • Persistent identifier (e.g., DOI) resolves to dataset landing page
  • Version in citation
  • Universal Numerical Fingerprint (UNF):  a checksum independent of file format, for tabular data files
  • Compliant with the Joint Declaration of Data Citation Principles

Download data citation ready to be used in reference manager

Metadata TO FIND AND REUSE DATA

At multiple Levels:

  • Citation metadata
  • Custom metadata
  • File metadata
  • Variable-level metadata 

With multiple Standards:

  • Data Documentation Initiative (DDI)
  • Dublin Core
  • Schema.org 

Download metadata in multiple formats

Schema.org  USed BY GOOGLE DAtaset SearcH

  • Schema.org JSON-ld embedded in HTML of dataset landing page
  • Datasets become discoverable through Google Dataset Search

Metadata from schema.org in Dataverse dataset landing page

Versioning OF DATASETS AND FILES 

  • Major and minor versions
  • Major versions show in the data citation
  • Track both metadata changes and files changes

 

TIERED ACCESS to Data

  • Default access is public, with CC0 waiver
  • Allow public and restricted files
  • Descriptive metadata always public for discoverability
  • Custom Terms of Use, when needed 
  • Optional Guestbook to collect information from users

Public

Restricted

Tabular data exploration

  • Variable metadata automatically extracted
  • Descriptive statistics automatically computed

Metadata Extraction from Astronomy files

Metadata (instrument information) is extracted automatically from FITS files header upon data upload

Metadata from FITS Header

MANAGE and customize your own Dataverse

  • Create a dataverse to manage your own collection of datasets
  • Brand your dataverse or embed in your website 

Extensive API  to Enable Tool Integration 

http://guides.dataverse.org

Community
Features
Data

Projects 

a wide Variety of Data and Dataverses

  • Dataverse for Journals
  • Dataverse for Researchers
  • Dataverse for Research Communities
  • Dataverse for one or multiple Institutions

Data Policies IN social science Journals

Crosas, Gautier, Karcher, Kirilova, Otalora, Schwartz, 2018. Data Policies of highly-ranked social science journals

More than 50% of the top 50 journals in anthropology, economics, psychology, and political sciences have data policies that either encourage or require to share the data associated with the article.

Dataverse for a Journal

Hosted at Harvard Dataverse repository  (80 journal dataverses)

Dataverse for a Researcher

Hosted at Harvard Dataverse repository

Dataverse for A RESEARCH COMMUNITY: Strucural Biology

Hosted SBGrid Consortium, Harvard Medical School

Dataverse For multiple Universities

Hosted by Texas Digital Libraries, a  consortium of Texas Higher-Education Institutions

Harvard Dataverse:
OPEN to ALL the RESearch COMMUNITY

Hosted Harvard University, in collaboration with Harvard Library, HUIT, and IQSS

http://dataverse.harvard.edu

# DatAsets Deposited at Harvard Dataverse:

29,256

Average Released Datasets per Month (in 2018):

247

 

# of total DOWNLOADs:
4M

Average downloads per month (in 2018):
150,000

 

Community
Features
Data

Projects 

On-GOIng Projects

  • Large data
  • Sensitive data
  • Data Quality, Reproducibility,  Reusability
  • Open Source 'Health' Index

Large Data

 

More ways to upload data

  • rsync

More ways to access data:

  • Local access
  • Compute in the cloud
  • Compute in institutional
    research computing portals
  • Integration w/ Globus?

More storages:

  • ​Remote secure storage; data enclaves

 

Funding by Helmsley Charitable Trust,  with focus on biomedical data, in collaboration with Piotrek Sliz

Sensitive Data: DataTaGS 

Funded by National Science Foundation, 

in collaboration with Latanya Sweeney

Standardize data security and access levels

Sensitive Data: Privacy Preserving Tools

Funded by National Science Foundation,

in collaboration with Harvard Privacy Tools Project

Integration with reprodUcibility Tools:
Code Ocean

Funded by Sloan Foundation, in collaboration with CodeOcean

Integration with repRODUCIBILITY Tools:
Encapsulator

Funded by Sloan Foundation, in collaboration with Margo Seltzer

Integration with reproducibility Tools:
CORE2

Funded by Sloan Foundation, in collaboration with the ODUM institute at UNC Chapel Hill

OPEN Source 'Health' Index

  • A quantitative study to determine a health index for open source projects
  • Leverage previous work (e.g., LYRASIS project and  Qualification and Selection of Open Source Software (QSOS) )

Funded by IMLS

Thank YOu!

 

dataverse.org

dataverse.harvard.edu

The Dataverse Team @IQSS
https://groups.google.com/forum/#!forum/dataverse-community

 

scholar.harvard.edu/mercecrosas

@mercecrosas

Made with Slides.com