How to Lose a Legacy

Software Citation in Astronomy

Daina Bouquin

Harvard-Smithsonian Center for Astrophysics

daina.bouquin@cfa.harvard.edu

@dainabouquin

I'm a librarian.

Harvard University

Smithsonian Institution

 

Some things that I work on:

  • arXiv Next Generation IT Advisory Group
  • CfA Scientific Computation Advisory Committee
  • Mozilla Foundation Open Leaders Advisor
  • Software Preservation Network Steering Committee
  • Unified Astronomy Thesaurus Steering Committee

Semantics

The relationships between signifiers

and what they stand for in reality.

 

How we understand what something means.

Lexicon

Vocabulary of a person, language, or branch of knowledge.

 

(contains the signifiers)

Legacy

  1. Something superseded but difficult to replace.
  2. Something received from an ancestor or predecessor.
  3. Having a privilege or special status.

Copernicus, N. (1543). Nicolai Copernici Torinensis De revolutionibus orbium cœlestium libri vi. Norimbergae: Apud Ioh. Petreium.

Sometimes all three

  • Superseded but difficult to replace.
  • Received from an ancestor or predecessor.
  • Has a privilege or special status.

Galilei, G. (1610). Osservazioni e calcoli relativi ai Pianeti Medicei.

Galileo (67 years later)

 

Threatened with torture

 

Imprisoned for life

 

Burned his books

Galileo didn't know his chicken scratch would be important.

(Largely seen as the birth of observational astronomy and the scientific method)

 

People didn't care that much about Copernicus' model. 

(It was easy to dismiss)

Meaning is collective agreement about a specific thing at a specific time.

 

Semantic meaning is not static.

Humphrey, S.D. (1849). Multiple Exposures of the Moon: Nine Exposures, daguerreotype. http://id.lib.harvard.edu/images/olvwork124646/catalog

Sometimes it's more about privilege.

 

Earliest image of the moon extant.

earliest surviving image

 

Now it's art.

Provenance

means context

Daguerreotype "Recipe book"

Matters because of its relationship to astro daguerreotypes.

 

Provenance guides prioritization for curation.

 

Curation is work.

Metadata

Mechanisms for modeling relationships between the information gathered from provenancial sources.

The creators of these objects did not need to care about the historic meaning of their work.

 

Provenance could be determined so we gave these things meaning and prioritized them for curation.

 

We know what to call these things and

we know how to take care of them.

 

These items are part of your astronomical heritage.

 

We don't have norms yet for how to give things like this semantic meaning.

 

  • Superseded.
  • Received from a predecessor.
  • Privileged.

 

Knowledge is more than books and articles.

The fact that a thing exists in a place at a time does not give it meaning or make it identifiable.

 

I can describe this thing but give it little meaning.

 

Cultural norms prevent me from throwing this away.

(I would feel bad)

"I bet there's a paper."


A paper could provide some provenance.

 

Our record should definitely have a field

where we can identify a relevant paper.

 

Remember though:

  • It would take time and effort to find a paper.
  • If the paper exists it is probably behind a paywall (privilege).
  • I might not be legally able to own or distribute the paper (publishing models).

Who got to be an author on the paper?

 

Who didn't?

 

Is the author of the paper identical to

the creator of this thing?

 

Who gets credit?

This object is not a paper.

 

Disambiguation

We need to be able to directly identify the object to distinguish between the object and our sources of provenance.

 

What are the nodes in our

semantic network?

 

What does this have to do with citations?

What if I wanted to cite the daguerreotype?

 

give credit to Samuel Dwight Humphrey

(the photographer)

 

help someone else find the daguerreotype

(it's a physical thing)

 

expand the object's semantic network
(allow its meaning to change)

 

I would not cite

Humphrey's handbook

Not the thing I want to cite

 

Wrong year

 

Humphrey gets credit, but

not for his daguerreotype of the moon

Humphrey, S. D. (1858). American Hand Book of the Daguerreotype. (5th ed.)

Humphrey, S.D. (1849). Multiple Exposures of the Moon: Nine Exposures, daguerreotype. http://id.lib.harvard.edu/images/olvwork124646/catalog

What if we were talking about software?

 

Software was not valued the way that papers and data are

(still are not)

but people wanted to give

software credit and software authors wanted credit, so they hacked the system.

No one writes "IRRIPLACEABLE" on all the important software and gives it to a librarian.

(and that's never going to happen)

 "preferred citations"

Citing Something Else

Remember why you wouldn't do this for the daguerreotype?

 

authorship ambiguity

Different software versions have different authors– how many papers would the software authors need to write?

 

makes locating software more difficult over time

Links break (if they exist at all)

 

can put open source documentation behind paywalls

Remember privilege? Not all software papers are OA

 

Software citations are made indistinguishable

from citations to for other purposes

Mentioning Software Doesn't Work Either

(acknowledgements)

 

Machines can't find these types of "citations"

(humans can just read them)

Titles are ambiguous  

(software also has many "aliases")

 

ADS search for software called "Stingray" returns papers about:

  • the Corvette.
  • actual stingrays (i.e., animals)
  • stingray-shaped objects
  • stingray nebula
  • multiple instruments named "STINGRAY"

"It's on GitHub."

 

just means it's in a place right now

What about pointing to the software's location?

(the repo)

This page doesn't exist there anymore.

 

URL
Uniform Resource Locator

 

Locations change.

Provenance changes.

Metadata changes.

Meaning changes.

 

URL

https://github.com/dfm/corner.py

was

 

 

Changes over time.

 

The meaning you are trying to express now will be different from what will be located at this URL later.

 

This is not what you cite because

it is fragile and has no unambiguous meaning.

https://github.com/dfm/triangle.py

This page has a URL: https://zenodo.org/record/53155

This page is an interface where metadata is displayed.

 

The metadata is stored

with the identifier (DOI).

 

The URL is just another piece of metadata.

Identification

Unambiguous way to point at a specific thing in a specific place at a specific time.

 

(DOI, URI, Bibcode, arXiv ID, etc.)

Location

Where the thing you are pointing at is at a specific time.

 

 

(URL)

Archives mint identifiers
and curate metadata to ensure your work is findable and has meaning that can change over time.

I want you to have a scientific legacy.

 

Software will be the foundation on which future generations must build new knowledge.

 

Your work is someone's heritage.

  • You control your metadata.

  • You are your own cataloger.

software authors

CodeMeta

Machine-actionable metadata about your software.

https://codemeta.github.io/index.html

 

Creating a CodeMeta file gives your software provenance so when you deposit your software in an archive, that archive understands how to take care of and understand your software.

CodeMeta uses JSON-LD
(JSON linked data)


Lets us translate our lexicon from one schema to another.

 

Enables interoperability and further contextualization.

 

Identifiers can be mapped to other identifiers.

Things you can do

right now

Software Authors

Cont.

  • Preferred citations/instructions about attribution
    • Make sure these enable direct access to the software using your DOI
  • Authorship 
    • If you have many versions of software, decide who the authors are for each version (also get ORCiDs).

Article Authors

  • Unambiguous, direct software citation
    • If the preferred citation is not to the software, cite the software and the other thing.
    • Always cite the archival copy of the code when it exists
      • You might need to look for it, so just do a quick check before you cite a development repo
      • If you need to cite the repo, cite a specific release  
  • Consider the version that you are citing.
    • Who are you trying to give credit?
  • Put software citations in the references section
  • Cite your own code in a software paper
    • ​tells others how you want it cited

A case study - Credit Lost

Two Decades of Software Citation in Astronomy

https://doi.org/10.3847/1538-4365/ab7be6

CodeMeta file generator

https://codemeta.github.io/codemeta-generator/

SSI Guidance for Archiving Software

http://doi.org/10.5281/zenodo.1327325

Archiving software using Zenodo/GitHub

https://guides.github.com/activities/citable-code/

Software Citation Checklist

http://doi.org/10.5281/zenodo.3479199

In-text software citation examples

https://www.astrobetter.com/blog/2019/07/01/citing-astronomy-software-inline-text-examples/

And yet it moves.

 

 

We have a complete history of nothing.

 

Some things get a legacy and some things don't.

 

Your work matters.

How to Lose a Legacy

By Daina Bouquin