research Data Management @Harvard,
Data Sharing,
and the Dataverse PrOject
Mercè Crosas, IQSS, Harvard University @mercecrosas
REsearch DATA ARE AN ASSET To the University, AND need to be Handled with CARE
Library
Information Sciences
HandLING Data with CARE,
A COLLAborative effort
Information TechnologY
Research
Vice-Provost for Research
Data Governance
Authorship,
data citation
Data Science
Security
Storage
Research
Computing
Repositories
Sponsored Research
Data Agreements
Software
Tools
Privacy
Research data management concerns the organization of data, from its entry to the research cycle through the dissemination and archiving of valuable results. It aims to ensure reliable verification of results, and permits new and innovative research built on existing information.
Whyte, A., Tedds, J. (2011). ‘Making the Case for Research Data Management’. DCC Briefing Papers. Edinburgh: Digital Curation Centre
Research Data management: AIMS
Must accommodate differences across research domains, data types, and methodologies:
- natural, physical, social sciences, humanities, health, biomedicine
- qualitative vs quantitative data and methods
- small, medium (MB to GB) vs large (TB to PB)
- structured vs unstructured, static vs streaming
Storage and
Analysis
Data Sharing and Archiving
Planning
REsearch Data Management: SCOPE
(site under development)
Planning YOUR DATA MANAGEMENT:
Key Points
- A Data Management Plan (DMP) helps you think how to organize, store, and share the data from your research project.
- A DMP is required by most research funding organizations.
- A DMP is a living document.
- The DMPTool assists you with requirements and templates
In 2018, InTernational DMPTool LAUNCHES
https://dmptool.org/
First, rigorously collected, well-preserved data sets — including meaningful descriptors or metadata — will help the data owners to reach solid, meaningful results. Second, they will help future investigators to make sense of and reuse data, thereby enhancing utility and reproducibility. Preserving comprehensive data, ideally for many years, also reduces the risk of duplicating science done by others.
Data Collection and ACquisition:
Key Points
- When data are collected by the researcher, consider using data collection tools (e.g., Electronic Lab Notebooks).
- When data are acquired from third party, consider defining a Data Use Agreement.
Data USe Agreement, as Defined at Harvard
Data Use Agreements govern access to and treatment of data:
- provided by an outside organization to your organization for use in your organization’s research, or
- provided by your organization to an outside organization for use in its research.
- Provide secure storage for sensitive datasets.
- Hybrid model for research computing increasingly popular: self-owned + public/private cloud options
- Consider offering consulting and training for research computing and data science (statistical models, machine learning, computational tools)
Storage and Computation:
KeY POints
- Make open research data the default, "as open as possible, as closed as needed" (Horizon 2020).
- Comply with funding organizations and journals requirements.
- Reward researchers when sharing data by giving them credit.
- Use trusted data repositories aligned with FAIR data principles.
DAta sharing:
Key POINTS
https://www.nature.com/articles/sdata201618
FAIR Principles:
4 principles, 15 sub-principles
FAIR Principles For Humans and Machines:
In Brief
-
Findable
- Globally unique, resolvable, and persistent identifier
- Machine-readable descriptions to support structured search
-
Accessible
- Metadata is accessible beyond the lifetime of the dataset
- Clearly defined access and security protocols (FAIR != Open)
-
Interoperable
- Extensible machine interpretable formats for data + metadata
- Use FAIR vocabularies and link to other resources
-
Reusable
- Provide licensing, provenance, and use community-standards
FAIR slides acknowledgement: Michel Dumontier
Technology can Help
A Solution for Data Sharing and Archiving
Aligned with FAIR data principles
DATAVERSE IS A REPOSITORY
for finding, citing, and publishing data
Dataverse is A Platform
for building your own data repository
DATAVERSE IS A Community
which facilitates data access and data sharing around the world
Community
Features
Data
Projects
Dataverse.ORG
35 Dataverse Repositories sites around the world
Dataverse OPEN_SOURCE Community GROWTH
2006
Dataverse Development starts at Harvard's Institute for Quantitative Social Science (IQSS)
2
Dataverse sites
2015
14
Dataverse sites
2017
23
Dataverse sites
2018
35
Dataverse sites
2016
18
Dataverse sites
First Annual Dataverse Community Meeting
4 developers
First release
74 development contributors
30 releases
12,807 commits
Global Dataverse Community Consortium
In 2018, a new international consortium is formed to support and coordinate efforts across Dataverse Repositories.
http://dataversecommunity.global (coming soon!)
Community
Features
Data
Projects
Data Citation:
Credit as an incentive to Share Data
- A formal data citation automatically generated
- Attribution to data creators and data providers
- Persistent identifier (e.g., DOI) resolves to dataset landing page
- Version in citation
- Universal Numerical Fingerprint (UNF): a checksum independent of file format, for tabular data files
- Compliant with the Joint Declaration of Data Citation Principles
Download data citation ready to be used in reference manager
Metadata TO FIND AND REUSE DATA
At multiple Levels:
- Citation metadata
- Custom metadata
- File metadata
- Variable-level metadata
With multiple Standards:
- Data Documentation Initiative (DDI)
- Dublin Core
- Schema.org
Download metadata in multiple formats
Schema.org USed BY GOOGLE DAtaset SearcH
- Schema.org JSON-ld embedded in HTML of dataset landing page
- Datasets become discoverable through Google Dataset Search
Metadata from schema.org in Dataverse dataset landing page
Versioning OF DATASETS AND FILES
- Major and minor versions
- Major versions show in the data citation
- Track both metadata changes and files changes
TIERED ACCESS to Data
- Default access is public, with CC0 waiver
- Allow public and restricted files
- Descriptive metadata always public for discoverability
- Custom Terms of Use, when needed
- Optional Guestbook to collect information from users
Public
Restricted
Tabular data exploration
- Variable metadata automatically extracted
- Descriptive statistics automatically computed
Metadata Extraction from Astronomy files
Metadata (instrument information) is extracted automatically from FITS files header upon data upload
Metadata from FITS Header
MANAGE and customize your own Dataverse
- Create a dataverse to manage your own collection of datasets
- Brand your dataverse or embed in your website
Extensive API to Enable Tool Integration
http://guides.dataverse.org
Community
Features
Data
Projects
a wide Variety of Data and Dataverses
- Dataverse for Journals
- Dataverse for Researchers
- Dataverse for Research Communities
- Dataverse for one or multiple Institutions
Data Policies IN social science Journals
Crosas, Gautier, Karcher, Kirilova, Otalora, Schwartz, 2018. Data Policies of highly-ranked social science journals
More than 50% of the top 50 journals in anthropology, economics, psychology, and political sciences have data policies that either encourage or require to share the data associated with the article.
Dataverse for a Journal
Hosted at Harvard Dataverse repository (80 journal dataverses)
Dataverse for a Researcher
Hosted at Harvard Dataverse repository
Dataverse for A RESEARCH COMMUNITY: Strucural Biology
Hosted SBGrid Consortium, Harvard Medical School
Dataverse For multiple Universities
Hosted by Texas Digital Libraries, a consortium of Texas Higher-Education Institutions
DataverseNL supports research data for 13 institutions in the Netherlands
Harvard Dataverse:
OPEN to ALL the RESearch COMMUNITY
Hosted by Harvard University, in collaboration with Harvard Library, HUIT, and IQSS
http://dataverse.harvard.edu
# DatAsets Deposited at Harvard Dataverse:
30,000
Average # Released Datasets per Month (in 2018):
250
# of total DOWNLOADs:
5M
Average # downloads per month (in 2018):
150,000
Community
Features
Data
Projects
On-GOIng Projects
- Large data
- Sensitive data
- Data Quality, Reproducibility, Reusability
- Open Source 'Health' Index
Large Data
More ways to upload data
- rsync
More ways to access data:
- Local access
- Compute in the cloud
- Compute in institutional
research computing
More storages:
- Remote secure storage; data enclaves
Funded partially by Helmsley Charitable Trust, with focus on biomedical data, in collaboration with Piotrek Sliz
Sensitive Data: DataTaGS
Funded partially by National Science Foundation,
in collaboration with Latanya Sweeney
Standardize data security and access levels
Integration with reprodUcibility Tools:
Code Ocean
Funded by Sloan Foundation, in collaboration with CodeOcean
Integration with repRODUCIBILITY Tools:
Encapsulator
Funded by Sloan Foundation, in collaboration with Margo Seltzer
Integration with reproducibility Tools:
CORE2
Funded by Sloan Foundation, in collaboration with the ODUM institute at UNC Chapel Hill
OPEN Source 'Health' Index
- A quantitative study to determine a health index for open source projects
- Leverage previous work (e.g., LYRASIS project and Qualification and Selection of Open Source Software (QSOS) )
Funded by IMLS
TOP 5 REASONS why ...
...you should have a Dataverse repository in Portugal:
- Be compliant with research funding organizations that require FAIR data (e.g., Horizon 2020).
- Be compliant with journals that require submission of supporting data files to accompany manuscripts.
-
Give credit to Portuguese researchers for sharing data
-
Archive and maintain full control of a critical research asset for Portugal: your data
-
Improve collaboration and improve science by sharing your data with the world
Obrigada!
dataverse.org | dataverse.harvard.edu | The Dataverse Team @IQSS
https://groups.google.com/forum/#!forum/dataverse-community
scholar.harvard.edu/mercecrosas @mercecrosas