Supporting Citation of Subsetted Data:
A Data Center’s Perspective

Shannon Rauch

 

Summer ESIP 2015

14 July 2015

 

Why Cite Data?

"Data citation is intended to help guard the integrity of scholarly conclusions"

(Starr et al. 2015)

  • Transparency
  • Verification and reproducibility of results
  • Attribution
  • Discovery 
  • Improved re-use of scholarly data

The Value of Data Citation is Well Known

... As evidenced by the number of working groups,
guidelines, best practices, etc.

8 Core Principals of Data Citation

Joint Declaration of Data Citation Principals (JDDCP), Force 11

8 Core Principals of Data Citation

Joint Declaration of Data Citation Principals (JDDCP), Force 11

How?

Most guidelines suggest something like this at a minimum:

Creator (Year). Title. Provider. Identifier. 

Also recommended: URL, version/edition, date accessed, description of subset used.

Adapted from: http://datapub.cdlib.org/

Zwally, H., R. Schutz, C. Bentley, J. Bufton, T. Herring, J. Minster, J. Spinhirne, and R. Thomas. 2011. GLAS/ICESat L1A Global Altimetry Data. Version 33. Boulder, Colorado USA: NASA National Snow and Ice Data Center Distributed Active Archive Center. DOI: 10.5067/ICESAT/GLAS/DATA121

 

Siegel, David A. 2006. VERTIGO project Niskin bottle sample data from KM0414 and RR_K2 cruises. Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Accessed: 13 July 2015. DOI: 10.1575/1912/4199

 

Thurnherr, Andreas. 2007. Raw CTD Data from the East Pacific Rise at 9N acquired during the Atlantis expedition AT15-12 (2006). IEDA. Accessed 13 January 2015 from http://www.marine-geo.org/tools/search/Files.php?data_set_uid=6210

Easier SAID than DONE?

Data repository mentioned in text and supplemental materials...

Easier SAID than DONE?

...or in table and figure captions...

Easier SAID than DONE?

... or in Acknowledgements.

On the right track

Wait...

Is that really sufficient?

Data Are Not (usually) Static

Versioning: Datasets change (additions, deletions, edits/corrections)

 

Subsetting: Datasets are sub-selected/filtered in various ways (locally, by the end user, or at the data repository using tools, workbenches, and other black boxes)

Two Flavors of Dynamic:

Versioning & Subsetting

Ideally, can we keep the same URL for the dataset that it already has? It’s been disseminated pretty widely by now, and it would be tough at this stage to change it in our upcoming pub.

 

Which brings up another question — is there a way to put this dataset in version control?  Our next paper will build on it and probably expand it — would be convenient to keep all versions of the dataset in the same “box” so searchers can find the exact version used for a particular analysis, but can also see the latest version if they’re interested in using it for their own purposes.

 

Sorry if my Github-iness is tough for BCO-DMO.  If we can’t keep the same URL, can we at least keep the old version up and include a link to the new and improved dataset?

A (Recent) Real Email from a Researcher

A Case for Versioning

Subsetting

"Deep Citation"

Deep citation refers to citation of subsets of datasets
(like referencing page numbers in a book)

 

  • Subsets by row, column, or both (e.g. time period, location, species, cruises, parameter values, etc.)
  • Subsets used by researchers are often described in text in their paper's Methods section, or Supplemental Material...
  • ...but subsets can be cited more formally.

 

  1. Save each subset associated with a study/publication as a data object with unique identifier.
    • redundant? scalability issues?
  2. Cite entire dataset, provide text description of subset used.
    • specific enough to enable reproducibility?
  3. Assign unique, citable identifier to query used to produce the subset.

Approaches to Citing Subsets

Things to consider:

Level of granularity needed

Human needs vs machine needs

Approaches to Citing Subsets

Example format suggested by NSIDC:

Author's Name. Year of Publication. Title of Data Set and Version Number, [indicate subset used]. Boulder, Colorado USA: National Snow and Ice Data Center. DOI.

 

Njoku, Eni. 2004, updated daily. AMSR-E/Aqua L2B Surface Soil Moisture, Ancillary Parms, & QC EASE-Grids V002, March to June 2004. Boulder, Colorado USA: National Snow and Ice Data Center. http://dx.doi.org/10.7265/N5.

Other groups/authors recommend assigning unique persistent identifiers (PIDs) (DOI, URI, ARK, etc.) to the subset and/or query used to generate the subset...

RDA Working Group on Data Citation (WGDC) Recommends:

(1) Store data in a versioned, time-stamped manner.

(2) Assign persistent identifiers (PIDs) to time-stamped queries that can be re-executed.

"It allows identifying, retrieving and citing the precise data set with minimal storage overhead by only storing the versioned data and the queries used for creating the data set... Data sets can be re-created on demand."

The BCO-DMO Use Case

The Two Challenges:

Support citation of versioned and subsetted data

The Biological and Chemical Oceanography Data Management Office (BCO-DMO) works with investigators to serve data online from research projects funded by the National Science Foundation's (NSF) Biological and Chemical Oceanography Sections (OCE) and the Division of Polar Programs Antarctic Organisms & Ecosystems Program (PLR), and to support other NSF-funded marine ecosystems researchers.

BCO-DMO Data Holdings

BCO-DMO serves:

7,500+ datasets

from 500+ projects

involving 1,800+ researchers

 

Current versioning and subsetting practices:

  • Only the most current version of each dataset is available online (though previous versions are retained on BCO-DMO servers)
  • Simple subsetting tools are available to end-users (though most users download the full dataset and work with the data locally)

Previous Data Citation Work

Leadbetter, A., Raymond, L., Chandler, C., Pikula, L., Pissierssens, P., Urban, E. (2013) Ocean Data Publication Cookbook. Paris: UNESCO, 41 pp. & annexes. (Manuals and Guides. Intergovernmental Oceanographic Commission, 64), (IOC/MG/64)

“Cookbook” written for data managers and librarians who are interested in assigning a permanent identifier to a dataset for the purposes of publishing that dataset online and for the citation of that dataset within the scientific literature. 

Currently,

We can suggest citation in this format:

Buesseler, K. 2006. VERTIGO RR_K2 Cruise Event Log. Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Accessed: 14 February 2006. http://www.bco-dmo.org/dataset-deployment/451622

 

Maas, Amy E., 2012. Pteropod respiration rates from NW Atlantic and NE Pacific; OC473 (2011) and NH1208 (2012). Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. DOI: 10.1575/1912/6421

 

Can we / should we accommodate citation of versions and subsets?

One Example of Versioning

One Example of Versioning

Links to previous versions provided in dataset header/metadata

One Example of Versioning

Previous versions link back to newer versions.

Is this sufficient?

Citation examples:

Morris, J.J. 2015.  OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 07 July 2015. URL: http://www.bco-dmo.org/dataset/554221.

 

Morris, J.J. 2015.  OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 26 March 2015. URL: http://www.bco-dmo.org/dataset/554221.

 

Morris, J.J. 2015.  OA Literature Review; Literature review of Ocean Acidification (OA) effects on phytoplankton. Biological and Chemical Oceanography Data Management Office (BCO-DMO). Version date 26 March 2015. URL: http://dmoserv3.bco-dmo.org/jg/serv/BCO-DMO/P-ExpEv/OA_Lit_Review_03262015.html0

Note: these 2 URLs are the same (metadata landing page)

Note: this URL goes to specific data version rather than landing page. --> not ideal

So does each version need its own landing page?

One Example of Subsetting

Subsetted URL:
http://data.bco-dmo.org/jg/serv/BCO/DOPUtilization/Biogeochemistry.html0?(CruiseId=X0804)%7C(CruiseId=X0705)

Full dataset URL: http://data.bco-dmo.org/jg/serv/BCO/DOPUtilization/Biogeochemistry.html0

Is this URL citeable? Maybe. But disconnected from metadata landing page.

This could be given a PID and cited with the landing page URL. But is there a need? Ultimately, most researchers are still subsetting data locally on their own machines.

The Future?

Maas, Amy E., 2012. Pteropod respiration rates from NW Atlantic and NE Pacific; OC473 (2011) and NH1208 (2012). Biological and Chemical Oceanography Data System. BCO-DMO, WHOI. Version 1.0. PID: 123456. DOI: 10.1575/1912/6421.

 

Is this the type of citation BCO-DMO should be moving toward enabling?

Bringing it all together...

We (data managers, repositories, librarians, etc.) understand the importance of data citation and there is general consensus about how to do so.

 

Versioning should be accounted for, though many repositories have traditionally provided online access to only most current version.

 

Citation of data subsets is complex. No one-size-fits-all solution. Different approaches for different communities of researchers?

Changing the Culture

(1) Information managers recognize the importance of data citation and have developed many recommendations.  
But are those being put to use by researchers?

 

(2) If we are not seeing basic citation of datasets in a standard format,
how can we encourage citation of versioned and subsetted data?

References

Altman, et al. (2007) A proposed standard for the scholarly citation of quantitative data. D-Lib Magazine 13(3/4). doi: 10.1045/march2007-altman

Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 [https://www.force11.org/datacitation].

Rauber, A. et al. June 9th (2015) Data Citation of Evolving Data, Recommendations of the Working Group on Data Citation (WGDC). https://www.rd-alliance.org/system/files/documents/RDA-DC-Recommendations_150609.pdf

Starr, J. et al. (2015) Achieving human and machine accessibility of cited data in scholarly publications. PeerJ Comp Sci 1:e1 https://dx.doi.org/10.7717/peerj-cs.1

Tilmes, C. (2011) Data Identifiers and Citations Enable Reproducible Science. AGU Fall Meeting 2011. http://wiki.esipfed.org/images/b/b6/TilmesAGU.pdf

Black Boxes

Do we cite the bread? The toast? The toaster?
What about the toaster settings?

Supporting Citation of Subsetted Data: A Data Center’s Perspective

By Shannon R

Supporting Citation of Subsetted Data: A Data Center’s Perspective

Presentation given at Summer ESIP. July 14, 2015.

  • 1,791