Dataverse
Mercè Crosas, IQSS, Harvard University @mercecrosas
DATAVERSE IS A REPOSITORY
for finding, citing, and publishing data
Dataverse is A Platform
for building your own data repository
DATAVERSE IS A Community
which facilitates data access and data sharing around the world
Community
Features
Data
Projects
A Growing, ENGAGED Community
Dataverse.ORG
33 Dataverse Repositories sites around the world
Dataverse Community GROWTH
2006
Dataverse Development starts at Harvard's Institute for Quantitative Social Science (IQSS)
2
Dataverse sites
2015
14
Dataverse sites
2017
23
Dataverse sites
2018
33
Dataverse sites
2016
18
Dataverse sites
First Annual Dataverse Community Meeting
4 developers
First release
74 contributors
30 releases
12,807 commits
Global Dataverse Community Consortium
In 2018, a new international consortium is formed to support and coordinate efforts across Dataverse Repositories.
http://dataversecommunity.global (coming soon!)
Building an Active, Engaged Community WITH:
- Transparency and Common Knowledge
- Process and Tools
- Human Touch
TransParency and common Knowledge
- High-level goals and roadmap in dataverse.org site
- Development status in Waffle
- Issues discussions in GitHub
- General discussions in Google Groups (mailing list)
Dataverse Waffle Board
+
Dataverse.org
Process to Support AN Agile DevElOpment
Engage early with contributors on technical design and user testing:
- Pull Request
- Code Review
- QA
- Release
Tools:
- Waffle
- GitHub
- Google groups
- irc
- Slack
THE Human touch
- Annual Community Meeting:
- ~150 people
- Organizations from ~ 15 countries
- Quick reply to mailing list (Google groups) and IRC
- Biweekly Call (last year):
- 23 Calls
- 228 Participants
- 18 Organizations
Dataverse World Cup!
Community
Features
Data
Projects
A RICH Set of USER-Friendly Features
Data Citation:
Credit as an incentive to Share Data
- A formal data citation automatically generated
- Attribution to data creators and data providers
- Persistent identifier (e.g., DOI) resolves to dataset landing page
- Version in citation
- Universal Numerical Fingerprint (UNF): a checksum independent of file format, for tabular data files
- Compliant with the Joint Declaration of Data Citation Principles
Download data citation ready to be used in reference manager
Metadata TO FIND AND REUSE DATA
At multiple Levels:
- Citation metadata
- Custom metadata
- File metadata
- Variable-level metadata
With multiple Standards:
- Data Documentation Initiative (DDI)
- Dublin Core
- Schema.org
Download metadata in multiple formats
Schema.org USed BY GOOGLE DAtaset SearcH
- Schema.org JSON-ld embedded in HTML of dataset landing page
- Datasets become discoverable through Google Dataset Search
Metadata from schema.org in Dataverse dataset landing page
Versioning OF DATASETS AND FILES
- Major and minor versions
- Major versions show in the data citation
- Track both metadata changes and files changes
TIERED ACCESS to Data
- Default access is public, with CC0 waiver
- Allow public and restricted files
- Descriptive metadata always public for discoverability
- Custom Terms of Use, when needed
- Optional Guestbook to collect information from users
Public
Restricted
Tabular data exploration
- Variable metadata automatically extracted
- Descriptive statistics automatically computed
Metadata Extraction from Astronomy files
Metadata (instrument information) is extracted automatically from FITS files header upon data upload
Metadata from FITS Header
MANAGE and customize your own Dataverse
- Create a dataverse to manage your own collection of datasets
- Brand your dataverse or embed in your website
Extensive API to Enable Tool Integration
http://guides.dataverse.org
Community
Features
Data
Projects
a wide Variety of Data and Dataverses
- Dataverse for Journals
- Dataverse for Researchers
- Dataverse for Research Communities
- Dataverse for one or multiple Institutions
Data Policies IN social science Journals
Crosas, Gautier, Karcher, Kirilova, Otalora, Schwartz, 2018. Data Policies of highly-ranked social science journals
More than 50% of the top 50 journals in anthropology, economics, psychology, and political sciences have data policies that either encourage or require to share the data associated with the article.
Dataverse for a Journal
Hosted at Harvard Dataverse repository (80 journal dataverses)
Dataverse for a Researcher
Hosted at Harvard Dataverse repository
Dataverse for A RESEARCH COMMUNITY: Strucural Biology
Hosted SBGrid Consortium, Harvard Medical School
Dataverse For multiple Universities
Hosted by Texas Digital Libraries, a consortium of Texas Higher-Education Institutions
Harvard Dataverse:
OPEN to ALL the RESearch COMMUNITY
Hosted Harvard University, in collaboration with Harvard Library, HUIT, and IQSS
http://dataverse.harvard.edu
# DatAsets Deposited at Harvard Dataverse:
29,256
Average Released Datasets per Month (in 2018):
247
# of total DOWNLOADs:
4M
Average downloads per month (in 2018):
150,000
Community
Features
Data
Projects
On-GOIng Projects
- Large data
- Sensitive data
- Data Quality, Reproducibility, Reusability
- Open Source 'Health' Index
Large Data
More ways to upload data
- rsync
More ways to access data:
- Local access
- Compute in the cloud
- Compute in institutional
research computing portals - Integration w/ Globus?
More storages:
- Remote secure storage; data enclaves
Funding by Helmsley Charitable Trust, with focus on biomedical data, in collaboration with Piotrek Sliz
Sensitive Data: DataTaGS
Funded by National Science Foundation,
in collaboration with Latanya Sweeney
Standardize data security and access levels
Sensitive Data: Privacy Preserving Tools
Funded by National Science Foundation,
in collaboration with Harvard Privacy Tools Project
Integration with reprodUcibility Tools:
Code Ocean
Funded by Sloan Foundation, in collaboration with CodeOcean
Integration with repRODUCIBILITY Tools:
Encapsulator
Funded by Sloan Foundation, in collaboration with Margo Seltzer
Integration with reproducibility Tools:
CORE2
Funded by Sloan Foundation, in collaboration with the ODUM institute at UNC Chapel Hill
OPEN Source 'Health' Index
- A quantitative study to determine a health index for open source projects
- Leverage previous work (e.g., LYRASIS project and Qualification and Selection of Open Source Software (QSOS) )
Funded by IMLS
Thank YOu!
dataverse.org
dataverse.harvard.edu
The Dataverse Team @IQSS
https://groups.google.com/forum/#!forum/dataverse-community
scholar.harvard.edu/mercecrosas
@mercecrosas
Dataverse
By Mercè Crosas
Dataverse
A repository, a platform, a community for sharing research data
- 2,307