The future of scholarly publication: automated, transparent, and open

Daniel Himmelstein (@dhimmel)

HighWire’s Lunch & Learn

IET London: Savoy Place

June 14th, 2019 11:15 AM

slides released under CC BY 4.0

abstract (website)

  • Focus: Modernisation, automation, machine-readability, voice search, real-time interaction, discovery
  • Key takeaway: Each pain point in the current publishing process is an opportunity that new technology and workflows can solve

 

Abstract

Scholarly publishing is far from perfect, but nonetheless plays a crucial role in the dissemination of knowledge. How can we modernise publishing to increase its benefits while decreasing its inefficiency?

Daniel will discuss how publishing can be automated, reducing inevitable imperfection and delays caused by manual steps. In addition, we’ll discuss machine-readability, a precursor to effective voice search, as well as living literature such that users can interact with and improve upon it in real-time.
What is the ideal system for scholarly publication in the future? How can publishers use automation and standards to ensure machine-readability and increased interaction with scholarly literature?

 

About the Speaker
Daniel is a data scientist at the University of Pennsylvania. He performs large-scale data analysis to uncover trends in scholarly publishing. For example, Daniel has investigated the time from submission to publication at thousands of journals, what percent of the literature is in Sci-Hub, how preprints are licensed, and what bibliographic styles journals have applied over time. Currently, Daniel researches human disease and leads development of Manubot, a tool for open scholarly writing on GitHub.

every problem is an opportunity

publishing is far from perfect

Problem: Publishing Times

Illustration by Matt Murphy.

© Nature News doi.org/f3mn4t

submission

acceptance

acceptance

publication

https://blog.dhimmel.com/history-of-delays/

https://blog.dhimmel.com/plos-and-publishing-delays

solution: automation

when authors submit a manuscript, can they immediately be shown a fully rendered proof? 

formatting problems: let the authors fix them. suggest changes from automated checks.

lossless submission

Reproducible Document Stack: towards a scalable solution for reproducible articles
Giuliano Maciocci, Emmy Tsang, Nokome Bentley and Michael Aufreiter

eLife Labs (2019-05-22)

As a first step, eLife aims to publish reproducible articles as companions of already accepted papers. We will endeavour to accept submissions of reproducible manuscripts in the form of DAR files by the end of 2019.

As a first step, eLife aims to publish reproducible articles as companions of already accepted papers. We will endeavour to accept submissions of reproducible manuscripts in the form of DAR files by the end of 2019.

single shared source

The Deep Review

  • review article on deep learning in precision medicine
  • 27 authors from 20 different institutions
  • readers appreciate the breadth of perspectives

most viewed bioRxiv preprint of 2017

citations, references & bibliographies

problem:

bibliographic buswork

What is a persistent identifier

a long lasting standardized reference to a citeable work

Vision

The only manual bibliographic step in the publication workflow, from authoring to production, is when an author chooses which work to cite.

citation by persistent identifier

This is a sentence with 5 citations [
  @doi:10.1038/nbt.3780;
  @pmid:29424689;
  @pmcid:PMC5938574;
  @arxiv:1407.3561;
  @url:https://greenelab.github.io/meta-review/
].

References

  1. Reproducibility of computational workflows is automated using continuous analysis
    Brett K Beaulieu-Jones, Casey S Greene
    Nature Biotechnology (2017-03-13) https://doi.org/f9ttx6
    DOI: 10.1038/nbt.3780 · PMID: 28288103 · PMCID: PMC6103790
     
  2. Sci-Hub provides access to nearly all scholarly literature.
    Daniel S Himmelstein, Ariel Rodriguez Romero, Jacob G Levernier, Thomas Anthony Munro, Stephen Reid McLaughlin, Bastian Greshake Tzovaras, Casey S Greene
    eLife (2018-03-01) https://www.ncbi.nlm.nih.gov/pubmed/29424689
    DOI: 10.7554/elife.32822 · PMID: 29424689 · PMCID: PMC5832410
     
  3. Opportunities and obstacles for deep learning in biology and medicine
    Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T. Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoffman, … Casey S. Greene
    Journal of the Royal Society Interface (2018-04) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5938574/
    DOI: 10.1098/rsif.2017.0387 · PMID: 29618526 · PMCID: PMC5938574
     
  4. IPFS - Content Addressed, Versioned, P2P File System
    Juan Benet
    arXiv (2014-07-14) https://arxiv.org/abs/1407.3561v1
     
  5. Open collaborative writing with Manubot
    Daniel S. Himmelstein, David R. Slochower, Venkat S. Malladi, Casey S. Greene, Anthony Gitter
    (2018-08-03) https://greenelab.github.io/meta-review/
This is a sentence with 5 citations [1,2,3,4,5].

Goals of citation by persistent ID

  • unambiguous references
  • lossless publishing workflows
  • easy retrieval of cited works
  • automated metadata generation
  • automated bibliographies
  • machine readability

science is continuous

publications are fixed

my first paper

Erratum:

After publication of this article [1], it has been noticed that Figs. 1 and 3 (Figs. 1 and 2 respectively here) had been incorrectly reverted in the original article [1].

Using PeerJ's comment feature to flag an error

hypothes.is jounral integration

the future: continuous but versioned

“Finally, we estimate that over a six-month period in 2015–2016, Sci-Hub provided access for 99.3% of valid incoming requests.”

— DOI: 10.7287/peerj.preprints.3100v1

“In the first version of this study, we mistakenly treated the log events as requests rather than downloads. Fortunately, Sci-Hub reviewed the preprint in a series of tweets, and pointed out the error…”

— DOI: 10.7287/peerj.preprints.3100v2

Timestamped on the Bitcoin blockchain via OpenTimestamps

timestamping

(update slide post this issue)

insufficient visibility & discoverability

Beyond the PDF First Day Notes

By De Jongens van de Tekeningen

Licensed under CC BY 3.0

Modified to invert colors

<meta name="DC.Format" content="text/html" />
<meta name="DC.Language" content="en" />
<meta name="DC.Title" content="Tracking the popularity and outcomes of all bioRxiv preprints" />
<meta name="DC.Identifier" content="10.1101/515643" />
<meta name="DC.Date" content="2019-04-02" />
<meta name="DC.Publisher" content="Cold Spring Harbor Laboratory" />
<meta name="DC.Rights" content="© 2019, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/" />
<meta name="DC.AccessRights" content="restricted" />
<meta name="DC.Description" content="Researchers in the life sciences are posting work to preprint servers at an unprecedented and increasing rate, sharing papers online before (or instead of) publication in peer-reviewed journals. Though the increasing acceptance of preprints is driving policy changes for journals and funders, there is little information about their usage. Here, we collected and analyzed data on all 37,648 preprints uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. We find preprints are being downloaded more than ever before (1.1 million tallied in October 2018 alone) and that the rate of preprints being posted has increased to a recent high of 2,100 per month. We also find that two-thirds of preprints posted before 2017 were later published in peer-reviewed journals, and find a relationship between journal impact factor and preprint downloads. Lastly, we developed Rxivist.org, a web application providing multiple ways of interacting with preprint metadata." />
<meta name="DC.Contributor" content="Richard J. Abdill" />
<meta name="DC.Contributor" content="Ran Blekhman" />
<meta name="article:published_time" content="2019-04-02" />
<meta name="article:section" content="New Results" />
<meta name="citation_title" content="Tracking the popularity and outcomes of all bioRxiv preprints" />
<meta name="citation_abstract" lang="en" content="&lt;h3&gt;Abstract&lt;/h3&gt;
&lt;p&gt;Researchers in the life sciences are posting work to preprint servers at an unprecedented and increasing rate, sharing papers online before (or instead of) publication in peer-reviewed journals. Though the increasing acceptance of preprints is driving policy changes for journals and funders, there is little information about their usage. Here, we collected and analyzed data on all 37,648 preprints uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. We find preprints are being downloaded more than ever before (1.1 million tallied in October 2018 alone) and that the rate of preprints being posted has increased to a recent high of 2,100 per month. We also find that two-thirds of preprints posted before 2017 were later published in peer-reviewed journals, and find a relationship between journal impact factor and preprint downloads. Lastly, we developed Rxivist.org, a web application providing multiple ways of interacting with preprint metadata.&lt;/p&gt;" />
<meta name="citation_journal_title" content="bioRxiv" />
<meta name="citation_publisher" content="Cold Spring Harbor Laboratory" />
<meta name="citation_publication_date" content="2019/01/01" />
<meta name="citation_mjid" content="biorxiv;515643v2" />
<meta name="citation_id" content="515643v2" />
<meta name="citation_public_url" content="https://www.biorxiv.org/content/10.1101/515643v2" />
<meta name="citation_abstract_html_url" content="https://www.biorxiv.org/content/10.1101/515643v2.abstract" />
<meta name="citation_full_html_url" content="https://www.biorxiv.org/content/10.1101/515643v2.full" />
<meta name="citation_pdf_url" content="https://www.biorxiv.org/content/biorxiv/early/2019/04/02/515643.full.pdf" />
<meta name="citation_doi" content="10.1101/515643" />
<meta name="citation_num_pages" content="65" />
<meta name="citation_article_type" content="Article" />
<meta name="citation_section" content="New Results" />
<meta name="citation_firstpage" content="515643" />
<meta name="citation_author" content="Richard J. Abdill" />
<meta name="citation_author_institution" content="Department of Genetics, Cell Biology, and Development, University of Minnesota" />
<meta name="citation_author_orcid" content="http://orcid.org/0000-0001-9565-5832" />
<meta name="citation_author" content="Ran Blekhman" />
<meta name="citation_author_institution" content="Department of Genetics, Cell Biology, and Development, University of Minnesota" />
<meta name="citation_author_institution" content="Department of Ecology, Evolution, and Behavior, University of Minnesota" />
<meta name="citation_author_email" content="blekhman@umn.edu" />
<meta name="citation_author_orcid" content="http://orcid.org/0000-0003-3218-613X" />

metadata in the HTML <head> of a bioRxiv preprint

hyperlinks in manuscripts

wanted since Y2K

hyperlinks in manuscripts

access status data from

Illustration by Matt Murphy.

© Nature News doi.org/f3mn4t

submission

acceptance

acceptance

publication

Time from submission to acceptance for 3,330,333 articles since 1965

https://blog.dhimmel.com/history-of-delays/

open peer review

no assessment wasted

peer review is slow

In addition, a member of Reviewer #2's lab reviewed the manuscript for the life sciences overlay biOverlay. The most important comments in that review are included in the points below. If there are additional comments from biOverlay that you wish to address in the revision please highlight these in your author response.

Peer Review Report for "Tracking the popularity and outcomes of all bioRxiv preprints" in eLife.

In addition, a member of Reviewer #2's lab reviewed the manuscript for the life sciences overlay biOverlay. The most important comments in that review are included in the points below. If there are additional comments from biOverlay that you wish to address in the revision please highlight these in your author response.

Source: PrePubMed, released under MIT License

Thanks!

@dhimmel

0000-0002-3012-7446

Slides
https://slides.com/dhimmel/highwire

Packing List
https://lighterpack.com/r/pzft6

Extra Slides

Outline

  • problem: delays
    solution: immediate publication
  • problem: introduction of errors
    solution: replace lossy manual pipelines with automated ones, DAR?
  • problem: bibliographic busywork
    solution: citation by identifier and automation / CSL
  • problem: erratum & corrigendum
    solution: versioning & inline comments
  • problem: insufficient visibility & discoverability
    solution: best practices for machine readability & voice search
  • What cannot be automated?
    peer review. How to make peer review more efficient?
    No reviews wasted. Link to preprints (and links between preprint and journals)

input

output

manubot process

Submitted to journal:
GEO refers to the Gene Expression Omnibus [165,166].

 

Published as:

GEO refers to the Gene Expression Omnibus (Edgar et al., 2002; Barrett et al., 20122013)

Highwire Event: The future of scholarly publication: automated, transparent, and open

By Daniel Himmelstein

Highwire Event: The future of scholarly publication: automated, transparent, and open

Presentation by Daniel Himmelstein at HighWire's London Lunch & Learn session on 2019-06-14. This presentation is released under a CC BY 4.0 License. A recording is available at https://vimeo.com/344994584.

  • 3,445