Shannon Rauch
Summer ESIP 2015
14 July 2015
"Data citation is intended to help guard the integrity of scholarly conclusions"
(Starr et al. 2015)
... As evidenced by the number of working groups,
guidelines, best practices, etc.
Creator (Year). Title. Provider. Identifier.
Adapted from: http://datapub.cdlib.org/
Zwally, H., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2011. GLAS/ICESat L1A Global Altimetry Data. Version 33. Boulder, Colorado USA: NASA National Snow and Ice Data Center Distributed Active Archive Center. DOI: 10.5067/ICESAT/GLAS/DATA121
Siegel, David A. 2006. VERTIGO project Niskin bottle sample data from KM0414 and RR_K2 cruises. Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Accessed: 13 July 2015. DOI: 10.1575/1912/4199
Thurnherr, Andreas. 2007. Raw CTD Data from the East Pacific Rise at 9N acquired during the Atlantis expedition AT15-12 (2006). IEDA. Accessed 13 January 2015 from http://www.marine-geo.org/tools/search/Files.php?data_set_uid=6210
Data repository mentioned in text and supplemental materials...
...or in table and figure captions...
... or in Acknowledgements.
Wait...
Is that really sufficient?
Versioning: Datasets change (additions, deletions, edits/corrections)
Subsetting: Datasets are sub-selected/filtered in various ways (locally, by the end user, or at the data repository using tools, workbenches, and other black boxes)
Ideally, can we keep the same URL for the dataset that it already has? It’s been disseminated pretty widely by now, and it would be tough at this stage to change it in our upcoming pub.
Which brings up another question — is there a way to put this dataset in version control? Our next paper will build on it and probably expand it — would be convenient to keep all versions of the dataset in the same “box” so searchers can find the exact version used for a particular analysis, but can also see the latest version if they’re interested in using it for their own purposes.
Sorry if my Github-iness is tough for BCO-DMO. If we can’t keep the same URL, can we at least keep the old version up and include a link to the new and improved dataset?
A (Recent) Real Email from a Researcher
"Deep Citation"
Deep citation refers to citation of subsets of datasets
(like referencing page numbers in a book)
Things to consider:
Level of granularity needed
Human needs vs machine needs
Example format suggested by NSIDC:
Author's Name. Year of Publication. Title of Data Set and Version Number, [indicate subset used]. Boulder, Colorado USA: National Snow and Ice Data Center. DOI.
Njoku, Eni. 2004, updated daily. AMSR-E/Aqua L2B Surface Soil Moisture, Ancillary Parms, & QC EASE-Grids V002, March to June 2004. Boulder, Colorado USA: National Snow and Ice Data Center. http://dx.doi.org/10.7265/N5.
Other groups/authors recommend assigning unique persistent identifiers (PIDs) (DOI, URI, ARK, etc.) to the subset and/or query used to generate the subset...
(1) Store data in a versioned, time-stamped manner.
(2) Assign persistent identifiers (PIDs) to time-stamped queries that can be re-executed.
"It allows identifying, retrieving and citing the precise data set with minimal storage overhead by only storing the versioned data and the queries used for creating the data set... Data sets can be re-created on demand."
Support citation of versioned and subsetted data
The Biological and Chemical Oceanography Data Management Office (BCO-DMO) works with investigators to serve data online from research projects funded by the National Science Foundation's (NSF) Biological and Chemical Oceanography Sections (OCE) and the Division of Polar Programs Antarctic Organisms & Ecosystems Program (PLR), and to support other NSF-funded marine ecosystems researchers.
BCO-DMO serves:
7,500+ datasets
from 500+ projects
involving 1,800+ researchers
Current versioning and subsetting practices:
Leadbetter, A., Raymond, L., Chandler, C., Pikula, L., Pissierssens, P., Urban, E. (2013) Ocean Data Publication Cookbook. Paris: UNESCO, 41 pp. & annexes. (Manuals and Guides. Intergovernmental Oceanographic Commission, 64), (IOC/MG/64)
“Cookbook” written for data managers and librarians who are interested in assigning a permanent identifier to a dataset for the purposes of publishing that dataset online and for the citation of that dataset within the scientific literature.
We can suggest citation in this format:
Buesseler, K. 2006. VERTIGO RR_K2 Cruise Event Log. Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Accessed: 14 February 2006. http://www.bco-dmo.org/dataset-deployment/451622
Maas, Amy E., 2012. Pteropod respiration rates from NW Atlantic and NE Pacific; OC473 (2011) and NH1208 (2012). Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. DOI: 10.1575/1912/6421
Can we / should we accommodate citation of versions and subsets?
Links to previous versions provided in dataset header/metadata
Previous versions link back to newer versions.
Citation examples:
Morris, J.J. 2015. OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 07 July 2015. URL: http://www.bco-dmo.org/dataset/554221.
Morris, J.J. 2015. OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 26 March 2015. URL: http://www.bco-dmo.org/dataset/554221.
Morris, J.J. 2015. OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 26 March 2015. URL: http://dmoserv3.bco-dmo.org/jg/serv/BCO-DMO/P-ExpEv/OA_Lit_Review_03262015.html0
Note: these 2 URLs are the same (metadata landing page)
Note: this URL goes to specific data version rather than landing page. --> not ideal
So does each version need its own landing page?
Subsetted URL:
http://data.bco-dmo.org/jg/serv/BCO/DOPUtilization/Biogeochemistry.html0?(CruiseId=X0804)%7C(CruiseId=X0705)
Full dataset URL: http://data.bco-dmo.org/jg/serv/BCO/DOPUtilization/Biogeochemistry.html0
Is this URL citeable? Maybe. But disconnected from metadata landing page.
This could be given a PID and cited with the landing page URL. But is there a need? Ultimately, most researchers are still subsetting data locally on their own machines.
Maas, Amy E., 2012. Pteropod respiration rates from NW Atlantic and NE Pacific; OC473 (2011) and NH1208 (2012). Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Version 1.0. PID: 123456. DOI: 10.1575/1912/6421.
Is this the type of citation BCO-DMO should be moving toward enabling?
We (data managers, repositories, librarians, etc.) understand the importance of data citation and there is general consensus about how to do so.
Versioning should be accounted for, though many repositories have traditionally provided online access to only most current version.
Citation of data subsets is complex. No one-size-fits-all solution. Different approaches for different communities of researchers?
(1) Information managers recognize the importance of data citation and have developed many recommendations.
But are those being put to use by researchers?
(2) If we are not seeing basic citation of datasets in a standard format,
how can we encourage citation of versioned and subsetted data?
Altman, et al. (2007) A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine 13(3/4). doi: 10.1045/march2007-altman
Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 [https://www.force11.org/datacitation].
Rauber, A. et al. June 9th (2015) Data Citation of Evolving Data, Recommendations of the Working Group on Data Citation (WGDC). https://www.rd-alliance.org/system/files/documents/RDA-DC-Recommendations_150609.pdf
Starr, J. et al. (2015) Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comp Sci 1:e1 https://dx.doi.org/10.7717/peerj-cs.1
Tilmes, C. (2011) Data Identifiers and Citations Enable Reproducible Science. AGU Fall Meeting 2011. http://wiki.esipfed.org/images/b/b6/TilmesAGU.pdf
Do we cite the bread? The toast? The toaster?
What about the toaster settings?