Supporting Citation of Subsetted Data:
A Data Center’s Perspective
Shannon Rauch
Summer ESIP 2015
14 July 2015
Why Cite Data?
"Data citation is intended to help guard the integrity of scholarly conclusions"
(Starr et al. 2015)
- Transparency
- Verification and reproducibility of results
- Attribution
- Discovery
- Improved re-use of scholarly data
The Value of Data Citation is Well Known
... As evidenced by the number of working groups,
guidelines, best practices, etc.
8 Core Principals of Data Citation
Joint Declaration of Data Citation Principals (JDDCP), Force 11
8 Core Principals of Data Citation
Joint Declaration of Data Citation Principals (JDDCP), Force 11
How?
Most guidelines suggest something like this at a minimum:
Creator (Year). Title. Provider. Identifier.
Also recommended: URL, version/edition, date accessed, description of subset used.
Adapted from: http://datapub.cdlib.org/
Zwally, H., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2011. GLAS/ICESat L1A Global Altimetry Data. Version 33. Boulder, Colorado USA: NASA National Snow and Ice Data Center Distributed Active Archive Center. DOI: 10.5067/ICESAT/GLAS/DATA121
Siegel, David A. 2006. VERTIGO project Niskin bottle sample data from KM0414 and RR_K2 cruises. Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Accessed: 13 July 2015. DOI: 10.1575/1912/4199
Thurnherr, Andreas. 2007. Raw CTD Data from the East Pacific Rise at 9N acquired during the Atlantis expedition AT15-12 (2006). IEDA. Accessed 13 January 2015 from http://www.marine-geo.org/tools/search/Files.php?data_set_uid=6210
Easier SAID than DONE?
Data repository mentioned in text and supplemental materials...
Easier SAID than DONE?
...or in table and figure captions...
Easier SAID than DONE?
... or in Acknowledgements.
On the right track
Wait...
Is that really sufficient?
Data Are Not (usually) Static
Versioning: Datasets change (additions, deletions, edits/corrections)
Subsetting: Datasets are sub-selected/filtered in various ways (locally, by the end user, or at the data repository using tools, workbenches, and other black boxes)
Two Flavors of Dynamic:
Versioning & Subsetting
Ideally, can we keep the same URL for the dataset that it already has? It’s been disseminated pretty widely by now, and it would be tough at this stage to change it in our upcoming pub.
Which brings up another question — is there a way to put this dataset in version control? Our next paper will build on it and probably expand it — would be convenient to keep all versions of the dataset in the same “box” so searchers can find the exact version used for a particular analysis, but can also see the latest version if they’re interested in using it for their own purposes.
Sorry if my Github-iness is tough for BCO-DMO. If we can’t keep the same URL, can we at least keep the old version up and include a link to the new and improved dataset?
A (Recent) Real Email from a Researcher
A Case for Versioning
Subsetting
"Deep Citation"
Deep citation refers to citation of subsets of datasets
(like referencing page numbers in a book)
- Subsets by row, column, or both (e.g. time period, location, species, cruises, parameter values, etc.)
- Subsets used by researchers are often described in text in their paper's Methods section, or Supplemental Material...
- ...but subsets can be cited more formally.
- Save each subset associated with a study/publication as a data object with unique identifier.
- redundant? scalability issues?
- Cite entire dataset, provide text description of subset used.
- specific enough to enable reproducibility?
- Assign unique, citable identifier to query used to produce the subset.
Approaches to Citing Subsets
Things to consider:
Level of granularity needed
Human needs vs machine needs
Approaches to Citing Subsets
Example format suggested by NSIDC:
Author's Name. Year of Publication. Title of Data Set and Version Number, [indicate subset used]. Boulder, Colorado USA: National Snow and Ice Data Center. DOI.
Njoku, Eni. 2004, updated daily. AMSR-E/Aqua L2B Surface Soil Moisture, Ancillary Parms, & QC EASE-Grids V002, March to June 2004. Boulder, Colorado USA: National Snow and Ice Data Center. http://dx.doi.org/10.7265/N5.
Other groups/authors recommend assigning unique persistent identifiers (PIDs) (DOI, URI, ARK, etc.) to the subset and/or query used to generate the subset...
RDA Working Group on Data Citation (WGDC) Recommends:
(1) Store data in a versioned, time-stamped manner.
(2) Assign persistent identifiers (PIDs) to time-stamped queries that can be re-executed.
"It allows identifying, retrieving and citing the precise data set with minimal storage overhead by only storing the versioned data and the queries used for creating the data set... Data sets can be re-created on demand."
The BCO-DMO Use Case
The Two Challenges:
Support citation of versioned and subsetted data
The Biological and Chemical Oceanography Data Management Office (BCO-DMO) works with investigators to serve data online from research projects funded by the National Science Foundation's (NSF) Biological and Chemical Oceanography Sections (OCE) and the Division of Polar Programs Antarctic Organisms & Ecosystems Program (PLR), and to support other NSF-funded marine ecosystems researchers.
BCO-DMO Data Holdings
BCO-DMO serves:
7,500+ datasets
from 500+ projects
involving 1,800+ researchers
Current versioning and subsetting practices:
- Only the most current version of each dataset is available online (though previous versions are retained on BCO-DMO servers)
- Simple subsetting tools are available to end-users (though most users download the full dataset and work with the data locally)
Previous Data Citation Work
Leadbetter, A., Raymond, L., Chandler, C., Pikula, L., Pissierssens, P., Urban, E. (2013) Ocean Data Publication Cookbook. Paris: UNESCO, 41 pp. & annexes. (Manuals and Guides. Intergovernmental Oceanographic Commission, 64), (IOC/MG/64)
“Cookbook” written for data managers and librarians who are interested in assigning a permanent identifier to a dataset for the purposes of publishing that dataset online and for the citation of that dataset within the scientific literature.
Currently,
We can suggest citation in this format:
Buesseler, K. 2006. VERTIGO RR_K2 Cruise Event Log. Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Accessed: 14 February 2006. http://www.bco-dmo.org/dataset-deployment/451622
Maas, Amy E., 2012. Pteropod respiration rates from NW Atlantic and NE Pacific; OC473 (2011) and NH1208 (2012). Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. DOI: 10.1575/1912/6421
Can we / should we accommodate citation of versions and subsets?
One Example of Versioning
One Example of Versioning
Links to previous versions provided in dataset header/metadata
One Example of Versioning
Previous versions link back to newer versions.
Is this sufficient?
Citation examples:
Morris, J.J. 2015. OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 07 July 2015. URL: http://www.bco-dmo.org/dataset/554221.
Morris, J.J. 2015. OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 26 March 2015. URL: http://www.bco-dmo.org/dataset/554221.
Morris, J.J. 2015. OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 26 March 2015. URL: http://dmoserv3.bco-dmo.org/jg/serv/BCO-DMO/P-ExpEv/OA_Lit_Review_03262015.html0
Note: these 2 URLs are the same (metadata landing page)
Note: this URL goes to specific data version rather than landing page. --> not ideal
So does each version need its own landing page?
One Example of Subsetting
Subsetted URL:
http://data.bco-dmo.org/jg/serv/BCO/DOPUtilization/Biogeochemistry.html0?(CruiseId=X0804)%7C(CruiseId=X0705)
Full dataset URL: http://data.bco-dmo.org/jg/serv/BCO/DOPUtilization/Biogeochemistry.html0
Is this URL citeable? Maybe. But disconnected from metadata landing page.
This could be given a PID and cited with the landing page URL. But is there a need? Ultimately, most researchers are still subsetting data locally on their own machines.
The Future?
Maas, Amy E., 2012. Pteropod respiration rates from NW Atlantic and NE Pacific; OC473 (2011) and NH1208 (2012). Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Version 1.0. PID: 123456. DOI: 10.1575/1912/6421.
Is this the type of citation BCO-DMO should be moving toward enabling?
Bringing it all together...
We (data managers, repositories, librarians, etc.) understand the importance of data citation and there is general consensus about how to do so.
Versioning should be accounted for, though many repositories have traditionally provided online access to only most current version.
Citation of data subsets is complex. No one-size-fits-all solution. Different approaches for different communities of researchers?
Changing the Culture
(1) Information managers recognize the importance of data citation and have developed many recommendations.
But are those being put to use by researchers?
(2) If we are not seeing basic citation of datasets in a standard format,
how can we encourage citation of versioned and subsetted data?
References
Altman, et al. (2007) A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine 13(3/4). doi: 10.1045/march2007-altman
Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 [https://www.force11.org/datacitation].
Rauber, A. et al. June 9th (2015) Data Citation of Evolving Data, Recommendations of the Working Group on Data Citation (WGDC). https://www.rd-alliance.org/system/files/documents/RDA-DC-Recommendations_150609.pdf
Starr, J. et al. (2015) Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comp Sci 1:e1 https://dx.doi.org/10.7717/peerj-cs.1
Tilmes, C. (2011) Data Identifiers and Citations Enable Reproducible Science. AGU Fall Meeting 2011. http://wiki.esipfed.org/images/b/b6/TilmesAGU.pdf
Black Boxes
Do we cite the bread? The toast? The toaster?
What about the toaster settings?
Supporting Citation of Subsetted Data: A Data Center’s Perspective
By Shannon R
Supporting Citation of Subsetted Data: A Data Center’s Perspective
Presentation given at Summer ESIP. July 14, 2015.
- 1,924