Considerations for Sharing and Preserving Research Software and Data

Daina Bouquin

Head Librarian

Harvard-Smithsonian Center for Astrophysics

Scientific Legacy


Your work will be the foundation on which the next generation must build an improved understanding of how the Universe works


But we are creating holes in the scientific record


Take the perspective of an institution of memory

An example:

Machine Learning

Software is inseparable from "the data"


ML frameworks trade off exact numeric determinism for performance and often require remote computing resources


Even if you copy development steps there will be 

tiny differences in the end results


This is the future (emergent reality) of scientific research

Digital Forensics

Stabilizing and recovering data from digital media

Whole fields are being born and augmented in response

Best practices are still developing and will need to incorporate

discipline-specific culture and values

Sharing your research is challenging in new ways.

This is not a "problem"


We need to acknowledge that changes are needed though

[Citation Needed]

  • need for a complete record of the research process
  • need to enable software discoverability
  • importance of research reproducibility
  • give credit to academic researchers of all levels for the software that they develop

Native data and software citation are vitally important, but what should be cited to properly give author(s) credit?


The astronomy community doesn't agree on how much someone should contribute to a code before that person is considered an author.

  • astronomers will cite a paper rather than natively citing code whether or not the code they want to cite is the same version as the code discussed in the paper
    • May contribute to software paper authors receiving disproportionate credit and current contributors not receiving any
  • Acknowledgement for contributions that might not fit a definition of "authorship"
    • giving all contributors equal credit as authors may serve to dilute the perceived importance of authoring software


Complicates authorship issues and issues pertaining to dependencies and documentation (metadata)


How do we deal with multiple forks? 


How should citations be calculated across different types of digital objects and versions of those objects?



e.g. IDL, IRAF, MATLAB, etc.


Restrictive licenses and no on-going support


We're still defining what "Fair Use" is in this landscape


Proceedings underway regarding filings with the US Copyright Office for Anti-Circumvention Exemption 


  • Papers 

    • Some publishers have more comprehensive citation/publishing policies (e.g. AAS)
  • GitHub/Zenodo integration

    • Native software citation
    • Versioned DOIs
  • Journal of Open Source Software

    • Peer review of code
    • Establishing partnership with AAS Publishing
  • ASCL

    • Index but no persistent IDs


NSF does require software management in the same way it requires data management

NASA does not specify software requirements (yet) but explicitly requires "data management"

Code requirements "are governed by guidance at the directorate, division, and program levels"

Investigators are "encouraged to consult with the cognizant program officer"

a backup is not an archive

Longterm persistence and

development / improvement of

metadata standards essential

Reality hits Re: reproducibility

If the cost of replicability was 1x (or more) the cost of the original work...

How do we balance this cost vs. the lost opportunity of doing new research?

Advocacy and community culture/values need to be incorporated into goals

Long term preservation is not typically the role of research computing groups


  • Scalable institutional support

  • consideration for long term curatorial needs


Must be developed in collaboration with institutions of memory and stakeholders throughout the scientific "lifecycle"

Big Vs. Small(er) projects

Having a well-funded archive and team of researchers helps to make all needed information artifacts accessible and usable. Many projects though do not have these resources, and many have even more data/more dynamic software.

Who takes responsibility for managing data/code long term?

"Best Practices" and standards will differ between domains and missions

NASA/CXC/PSU/L.Townsley et al

Ongoing Projects

(heroic efforts)


FAIR Software too






Metadata for Software

  • credit for academic software
    • citation metadata
  • replicate some analysis
    • versions and dependencies
  • discover software you don’t already know

Software Preservation Network


Software Sustainability Institute

US Research Software Sustainability Institute

Things you can do right now

  • License your data and code openly
  • Create persistent IDs for data and code (and authors!     )
  • Think critically about code authorship
    • Make these "non-traditional" research artifacts important in career advancement settings
  • Incorporate code into your DMP
  • Use an open language/open dev protocols
  • Learn about and try containerizing your code (Docker)
  • If you have proprietary code consider options for emulation
  • Advocate for computing and archiving infrastructure
  • Advocate for publishing practices/policies that appropriately respond to changing landscape



Open Scientific Culture


" much about getting consensus on the best practices and educating the community as it will be about the tools we come up with"

Pete Warden


Learn to problem solve in this landscape and advocate for resources and infrastructure to support

your goals


(Libraries can and should help)

Considerations for Sharing and Preserving Research Software and Data

By Daina Bouquin

Considerations for Sharing and Preserving Research Software and Data

Special Invited Talk at the 15th International HITRAN (high-resolution transmission molecular absorption database) Conference, 13-15 June 2018. Abstract:

  • 292
Loading comments...

More from Daina Bouquin