Legacy

Preservation and Scientific Software

Daina Bouquin

Harvard-Smithsonian Center for Astrophysics

daina.bouquin@cfa.harvard.edu

@dainabouquin

I'm a librarian.

Harvard University

Smithsonian Institution

 

Some things that I work on:

  • arXiv Next Generation IT Advisory Group
  • CfA Scientific Computation Advisory Committee
  • Harvard University Science Libraries Council
  • Mozilla Foundation Open Leaders Advisor
  • Software Preservation Network Steering Committee
  • Unified Astronomy Thesaurus Steering Committee

Semantics

The relationships between signifiers

and what they stand for in reality.

 

How we understand what something means.

Lexicon

Vocabulary of a person, language, or branch of knowledge.

 

(contains the signifiers)

Legacy

  1. Something superseded but difficult to replace.
  2. Something received from an ancestor or predecessor.
  3. Having a privilege or special status.

Copernicus, N. (1543). Nicolai Copernici Torinensis De revolutionibus orbium cœlestium libri vi. Norimbergae: Apud Ioh. Petreium.

Sometimes all three

  • Superseded but difficult to replace.
  • Received from an ancestor or predecessor.
  • Has a privilege or special status.

Galilei, G. (1610). Osservazioni e calcoli relativi ai Pianeti Medicei.

Galileo (67 years later)

 

Threatened with torture

 

Imprisoned for life

 

Burned his books

Galileo didn't know his chicken scratch would be important.

(Largely seen as the birth of observational astronomy and the scientific method)

 

People didn't care that much about Copernicus' model. 

(It was easy to dismiss)

Meaning is collective agreement about a specific thing at a specific time.

 

Semantic meaning is not static.

Humphrey, S.D. Multiple Exposures of the Moon: Nine Exposures, daguerreotype, 1849.

Sometimes it's more about privilege.

 

Earliest image of the moon extant.

 

There could have been other images of the moon.

Gift to the President of Harvard at the time.

(This is it on my desk.)

Provenance

means context

Daguerreotype "Recipe book"

Matters because of its relationship to the daguerreotype.

 

Provenance guides prioritization for curation.

 

Curation is work.

All objects need curation.

Everything will break.

Things need to be reformatted.

 

Entire fields are being developed in response:

Digital Forensics

Stabilizing and recovering data from digital media.

The creators of these objects did not need to care about the historic meaning of their work.

 

Provenance could be determined so we gave these things meaning and prioritized them for curation.

 

We know what to call these things and

we know how to take care of them.

We don't have norms yet for how to give things like this semantic meaning.

 

  • Superseded.
  • Received from a predecessor.

 

Knowledge is more than books and articles.

I have very little provenance.

 

When does something like this matter?


Who decides?

 

How do we semantically link this to anything?

 

How would someone find it?

(What do I call it?)

Metadata

Mechanisms for modeling relationships between the information gathered from provenancial sources.

Schema

Logical framework where

semantic metadata can be recorded.

The fact that a thing exists in a place at a time does not give it meaning or make it identifiable.

 

I can describe this thing but give it little meaning.

 

Cultural norms prevent me from throwing this away.

(I would feel bad)

"I bet there's a paper."


A paper could provide some provenance.

 

Our schema should definitely have a field

where we can identify a relevant paper.

 

Remember though:

  • It would take time and effort to find a paper.
  • If the paper exists it is probably behind a paywall (privilege).
  • I might not be legally able to own or distribute the paper (publishing models).

Who got to be an author on the paper?

 

Who didn't?

 

Is the "author" of the paper identical to

the "author" of this thing?

 

Who gets credit?

This object is not a paper.

 

Disambiguation

We need to be able to directly identify the object to distinguish between the object and our sources of provenance.

 

What are the nodes in our

semantic network?

What if this thing was software?

Some Human Readable Metadata

What makes something citable?

I want you to have a scientific legacy.

 

Software will be the foundation on which future generations must build new knowledge.

 

Your work is someone's heritage.

Code is speech.

"It's on GitHub."

 

Just means it's in a place right now.

Identification

Unambiguous way to point at a specific thing in a specific place at a specific time.

Location

Where the thing you are pointing at is at a specific time.

Born Networked

 

Exists in many ways

in many places over time.

The daguerreotype is also on Pinterest.

This page doesn't exist there anymore.

It also didn't tell me where the real thing is.

Is it on my desk or in a vault?

URL
Uniform Resource Locator

 

Locations change.

Provenance changes.

Meaning changes.

 

Identification

attached to machine actionable metadata

Identification

Identifier

 

DOI

URI
Bibcode

arXiv ID

etc.

Location

Locator

 

URL

URL

https://github.com/dfm/corner.py

was

 

 

Changes over time.

 

The meaning you are trying to express now will be different from what will be located at this URL later.

 

This is not what you cite because this has no unambiguous meaning.

https://github.com/dfm/triangle.py

Cite the DOI for the specific version of the thing you want to cite.

You already do this with papers.

This page has a URL: https://zenodo.org/record/53155

This page is an interface where metadata is displayed.

 

The metadata is stored

with the identifier (DOI).

 

The URL is just another piece of metadata.

DOIs are not magic

DOIs are resolvable.

 They are bound to metadata.

 

 Minted by a registry responsible for curating location metadata.

 

Resolves to a tombstone.

Archives mint identifiers and curate metadata to ensure your work is findable and has meaning that can change over time.

Summary: Identifiers let us unambiguously point and assign semantic meanings with metadata.

 

We can use metadata to make it clear that this is a record for software and

not a paper

ADS needs to index curated metadata about your work.

 

They can only work with the metadata they are given.

 

When we enrich metadata new connections are possible.

Who does the work? 

 

Libraries and archives aren't the direct

stewards of your work anymore.

 

We need to be able to find your work though.

You need to be able to make informed choices about it.

Our bibliographies represent your work.

We need to work together.

  • You control your metadata.

  • You are your own cataloger.

software authors

We can give you tools but you need to make choices.

You need to know when you're making choices that will impact your legacy.

Two different papers.

(Not the code)

Software DOIs don't guarantee software citation

complicated / conflicting author instructions

You cannot assume archival repositories know what to ask you for.

 

Systems need to change.


People who write software

need to decide what matters.

 

 

But we have started to define our lexicon.

Citation File Format

human- and machine-readable file format that provides citation metadata for software.

CodeMeta

more than citation metadata

CodeMeta uses JSON-LD
(JSON linked data)


Lets us translate our lexicon from one schema to another.

 

Enables interoperability and further contextualization.

 

Identifiers can be mapped to other identifiers.

We're working on defining new metadata architectures

 

e.g. SigMF (Signal Metadata Format)

Hardware is provenance

SSI/Jisc Guidance for Software Deposit

Jackson, M. (2018b). Software Deposit: What to deposit (Version 1.0). http://doi.org/10.5281/zenodo.1327325

 

 

Example: Jupyter Notebooks

Bouquin, D., Hou, S., Benzing, M., Wilson, L. (2019). Jupyter Notebooks: A Primer for Curators (Version v1.0).

http://doi.org/10.5281/zenodo.2591580

Working on Guidance

(building discipline specific resources too)

Things you can do

right now

Software Authors

  • Mint a software DOI
    • deposit a release of your software and metadata files (Zenodo, Figshare, an institutional repository, etc.)
  • Create a CFF file (minimal metadata for identification)
  • License your data and code explicitly
  • Update and check your metadata
    • Check it again
  • Link documentation to the source code directly
  • Ensure preferred citations/any instructions about attribution enable direct access to the software itself using your DOI
  • If you have many versions of software, decide who the authors are for each version (also get ORCiDs).
  • Cite archived software directly.

  • No one else will catch mistakes.

  • You are your own copy editor.

article authors

Article Authors

  • Unambiguous, direct software citation
    • If the preferred citation is not to the software, cite the software and the other thing.
    • Always cite the archival copy of the code when it exists
      • You might need to look for it.
  • Consider the version that you are citing.
    • Who are you trying to give credit?
  • Put software citations in the references section
  • Cite your own code in a software paper
    • ​tells others how you want it cited

And yet it moves.

 

 

We have a complete history of nothing.

Some things get a legacy and some things don't.

Your work matters.

Made with Slides.com