FAIR Principles & Persistent Identifiers

TWO




TASTES!

GREAT

mike nason
open scholarship & publishing librarian, unb libraries
crossref/metadata specialist, pkp

you got chocolate in my peanut butter!

introduction(s)

i'm the open scholarship & publishing librarian (aka, i guess, "scholarly communications") at what, to most of you, would be a pretty small school in atlantic canada (university of new brunswick).

i also work for pkp as a member of their publishing services team, where i am the crossref/metadata liaison.

i stand at a pretty interesting intersection between publishing, authoring, and discovery.

that means i more or less never shut up about open scholarly infrastructure!

like a lot of people who don't shut up, i look like this:

i am a [white, cis] settler from the unceded (aka, stolen) territory of the mi'kmaq-wolastoquey peoples just a short hop from the wolastoq river. settlers to the region renamed this river the “saint john river”, a testament to both their repression and lack of creativity.

lots of people (even, to my perpetual dismay, many canadians) do not know where new brunswick is. it is up here, next to maine.

it is north of nova scotia and west of prince edward island.

i won't sugarcoat this. i am about to cover a lot of ground, very fast.

buckle up!

what is FAIR?

not life, certainly...

FAIR

FAIR is set of principles – created by a diverse group of stakeholders across the scholarly research landscape – meant to address "an urgent need to improve the infrastructure supporting the reuse of scholarly data".

... "specific emphasis on enhancing the ability of machines to automatically find and use the data".

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).

10.1038/sdata.2016.18

unFAIR

researchers generate a profound volume of data. this volume and variety of data can make it hard to:

discover,
understand usage/access rights,
connect to other products of scholarship,
properly assert or assign credit,
correlate with institutions or funders.

basically, there are so many products of research spread across varied, disparate piles.

it is difficult to see relationships, find associated materials, and follow the narratives of research.

and! even if you can find something, you might not be at all sure whether you can access or reuse it.

you're, like, "i am not a data scientist"

you just said this out loud, maybe

"...we use the phrase ‘(meta)data’ in cases where the principles should be applied to both metadata and data."

i assume FAIR is an acronym?

this is you again. and uhhh, you bet it is

findability
accessibility
interoperability
reuse

fair enough!

i could not resist

findable

metadata and data should be easy to find for both humans and computers. machine-readable metadata are essential for automatic discovery of datasets and services.

(meta)data are assigned a globally unique and persistent identifier
data are described with rich metadata
metadata clearly and explicitly include the identifier of the data they describe
(meta)data are registered or indexed in a searchable resource

accessible

once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation.

(meta)data are retrievable by their identifier using a standardized communications protocol
metadata are accessible, even when the data are no longer available

interoperable

data usually need to be integrated with other data. data need to interoperate with applications or workflows for analysis, storage, and processing.

(meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
(meta)data use vocabularies that follow FAIR principles
(meta)data include qualified references to other (meta)data

reusable

metadata and data should be well-described so that they can be replicated and/or combined in different settings.

(meta)data are richly described with a plurality of accurate and relevant attributes

(meta)data are released with a clear and accessible data usage license
(meta)data are associated with detailed provenance
(meta)data meet domain-relevant community standards

what does this have to do with pids?

i am so, so glad you asked

let's (very quickly) review how pids work

we'll use the doi as our example

dois

dois are ubiquitous. we see them all over the place:

in references/bibliographies
on article/journal websites
in repositories
on published datasets
links

and, we probably know one handy thing about them:

if you click on a doi that looks like a link, it will take you to the thing.

dois

dois are the most prominent persistent identifier.

they are also, arguably, the most important persistent identifier.

how a doi works

a doi is made up of two chunks, and they mean different things.

prefix

10.4324

a prefix is usually associated with a publisher or organization. dois for that organization will usually have the same prefix.

suffix

9780203051238-5

a suffix is meant to be a machine-readable (not human-readable), opaque, unique string that is specific to the singular work to which it is assigned.

how a doi works

if i prepend a doi with https://doi.org/, it turns into a url.

https://doi.org/10.4324/9780203051238-5

clicking this will redirect me to the publication this doi is associated with.

the process of a doi redirecting you to a publication is called resolution.

how a doi works

dois aren't just like a bit.ly link or tiny.url.

if you're not familiar with these services, they swap out a very large and unwieldy link, so you can share something that isn't enormous.

for example: https://bit.ly/35LaEXj
this is a bit.ly link for a talk i did on osi

the actual url for that talk is: https://slides.com/ahemnason/persistent-identifiers-pids-and-open-scholarly-infrastructure/

bit.ly and tiny.url are both basic redirects.

how a doi works

a doi is a lot more than a redirect.

any single doi is a reference to an entire publication record. that publication record is full of metadata.

and, one of these metadata elements is the publication's url.

when you resolve a doi by clicking on it:

the record is accessed
the stored url is retrieved
you are sent to the stored url

the url can be updated by the publisher. the doi stays the same.

<?xml version="1.0" encoding="UTF-8"?>
<crossref_result version="3.0" xsi:schemaLocation="http://www.crossref.org/qrschema/3.0 http://www.crossref.org/schemas/crossref_query_output3.0.xsd">
  <query_result>
    <head>
      <doi_batch_id>none</doi_batch_id>
    </head>
    <body>
      <query status="resolved">
        <doi type="book_content">10.4324/9780203051238-5</doi>
        <crm-item name="publisher-name" type="string">Informa UK Limited</crm-item>
        <crm-item name="prefix-name" type="string">Informa UK (Routledge)</crm-item>
        <crm-item name="member-id" type="number">301</crm-item>
        <crm-item name="citation-id" type="number">122425695</crm-item>
        <crm-item name="book-id" type="number">1477192</crm-item>
        <crm-item name="deposit-timestamp" type="number">2020122110554080199</crm-item>
        <crm-item name="owner-prefix" type="string">10.4324</crm-item>
        <crm-item name="last-update" type="date">2020-12-21T15:07:00Z</crm-item>
        <crm-item name="created" type="date">2020-12-21T15:06:59Z</crm-item>
        <crm-item name="citedby-count" type="number">0</crm-item>
        <doi_record>
          <crossref xsi:schemaLocation="http://www.crossref.org/xschema/1.1 http://doi.crossref.org/schemas/unixref1.1.xsd">
            <book book_type="other">
              <book_metadata language="en">
                <contributors>
                  <person_name sequence="first" contributor_role="author">
                    <given_name>Richard</given_name>
                    <surname>Smiraglia</surname>
                  </person_name>
                </contributors>
                <titles>
                  <title>Metadata</title>
                  <subtitle>A Cataloger's Primer</subtitle>
                </titles>
                <edition_number>0</edition_number>
                <publication_date media_type="online">
                  <month>11</month>
                  <day>12</day>
                  <year>2012</year>
                </publication_date>
                <isbn media_type="electronic">9780203051238</isbn>
                <publisher>
                  <publisher_name>Routledge</publisher_name>
                </publisher>
                <doi_data>
                  <doi>10.4324/9780203051238</doi>
                  <timestamp>2020122110554078499</timestamp>
                  <resource>https://www.taylorfrancis.com/books/9781136435843</resource>
                </doi_data>
              </book_metadata>
              <content_item component_type="chapter" publication_type="full_text" language="en">
                <titles>
                  <title>Understanding Metadata and Metadata Schemes</title>
                </titles>
                <publication_date>
                  <year>2012</year>
                  <month>11</month>
                  <day>12</day>
                </publication_date>
                <pages>
                  <first_page>25</first_page>
                  <last_page>44</last_page>
                </pages>
                <doi_data>
                  <doi>10.4324/9780203051238-5</doi>
                  <timestamp>2020122110554080199</timestamp>
                  <resource>https://www.taylorfrancis.com/books/9781136435843/chapters/10.4324/9780203051238-5</resource>
                </doi_data>
              </content_item>
            </book>
          </crossref>
        </doi_record>
      </query>
    </body>
  </query_result>
</crossref_result>

i'm sorry to do this to you.

there's a lot of information here:

publisher
deposit and update timestamp
book type
contributors (first, role=author)
title and subtitle
publication date
doi for the book
link for the book
chapter title
doi for the chapter
link for the chapter

<doi_data>
  <doi>10.4324/9780203051238</doi>
  <timestamp>2020122110554078499</timestamp>
  <resource>https://www.taylorfrancis.com/books/9781136435843</resource>
</doi_data>
</book_metadata>
<content_item component_type="chapter" publication_type="full_text" language="en">
  <titles>
    <title>Understanding Metadata and Metadata Schemes</title>
  </titles>
  <publication_date>
    <year>2012</year>
    <month>11</month>
    <day>12</day>
  </publication_date>
  <pages>
    <first_page>25</first_page>
    <last_page>44</last_page>
  </pages>
  <doi_data>
    <doi>10.4324/9780203051238-5</doi>
    <timestamp>2020122110554080199</timestamp>
    <resource>https://www.taylorfrancis.com/books/9781136435843/chapters/10.4324/9780203051238-5</resource>
  </doi_data>
  

<!-- urls are part of the metadata of a doi. -->
<!-- when you change the location of content, you update your doi with the new location. everyone who uses the doi gets to the content no matter where you put it, so long as that doi is updated. this means, the doi is persistent.-->

publication metadata

unlike publications themselves, metadata is typically free. and, we can learn a lot from it. crossref, for example, can store the following things (not inclusive) as publicly accessible metadata:

title
subtitle
authors
orcids
affiliation

copyright license
funder/grant ids
languages
ror
references
resource location
version
publisher
journal/volume/issue
related dois
dates
abstracts

every doi registered with crossref or datacite (the two most prominent for scholarly works) is attached to a metadata record for that work!

both agencies maintain a public api that allows users to:

resolve dois
pull/view all registered metadata
push that metadata elsewhere

like, for example, ORCID.

this metadata is, as you might have guessed, hugely useful!

congratulations, you now know more about dois than a frankly surprising amount of people.

and, by extension, you now know more about pids than a frankly surprising amount of people.

no pid is an island

often, folks use the phrase "minting a doi" to describe the assignment of a doi to a work. i see this a lot. a journal editor might say to me, "i made all these dois but they don't work! i just get an error!"

a publisher can "mint" pids and provide them to you, but they need a third party to be at all useful.

to be at all useful, a pid needs to be registered with a pid registration agency.

registration agencies

persistent identifiers are managed by registration agencies (typically international not-for-profits) that store records/metadata, facilitate resolution requests, and may or may not offer other services based on membership. they do much of this through APIs.

there are a lot of registration agencies!

each agency may differ in mandate, governance, scope, service, supported objects, membership terms, and feature set.

they also, often, work together and share data.

pids for scholarly works

Crossref (DOI)

most scholarly publishers are crossref members. at the time i wrote this (1pm, april 6th) crossref had 134,294,189 dois registered with their service.

crossref are a big deal.

articles
proceedings
monographs
*datasets
funding agencies
grants
reports
standards
preprints

pids for scholarly works

Datacite (DOI)

while some scholarly publishers use datacite for article dois, it is much more commonly used in data/institutional/disciplinary repositories. datacite and crossref work together to connect research data to publications.

software
datasets
collections
audio/visual
event
model
*publications

pids for researchers

ORCID (ISNE)
Scopus ID
WoS Researcher ID

orcid are the go-to here, with scopus and wos offerings both restricted to publications present on those platforms. however, these services can share data between them.

researchers

pids for organizations

ROR
GRID
ISNE

the predominant use-case for organizational ids is in strengthening connections between records using open scholarly infrastructure.

organizations

registration agencies

registration agencies provide metadata schema through which users can describe the objects they are registering pids for.

as you can imagine, you'd describe a person differently than you'd describe a dataset, or a journal article, or an organization. even when agencies use the same type of pid (like the doi), the schema they use may vary.

what a persistent identifier does

pids make things easier to find, track, share, and access!

if my orcid id is present as metadata in the dois of the work i publish, i can pull my publication record easily and add it to my orcid profile

if my articles have dois, i can provide persistent links to their most recent location, which will ensure ease of access and citation

if a funding agency can pull metadata from my orcid profile, they can acquire all of my publication metadata without me having to fill out a pile of forms

this is open scholarly infrastructure

pids are in the drinking water of scholarly publishing

open scholarly infrastructure

this network of APIs is like a municipal water system (get it?). it is, increasingly, infrastructure relied upon by researchers and institutions whether or not they are really aware of it.

almost all open scholarly infrastructure is based around APIs.

open scholarly infrastructure is a network of scholarly-research-focused open-source platforms, service providers, and APIs that work in concert to share data, illuminate relationships, and make research more discoverable.

https://openscholarlyinfrastructure.org/

api

quite possibly the most-used and least-understood acronym in modern librarianship, api stands for:

Application
Programming
Interface

an API is, basically, a set of rules for interacting with software.

think of it as being a little like a translator working as an intermediary between two people who don't speak the same language.

api

APIs are everywhere. when my calendar app tells me today's forecast, it's accessing that information using the Accuweather API. when my watch vibrates because i got a text message, that's because Garmin's API is communicating with Apple's notifications API.

APIs are how disparate systems, built by different people, using different languages and definitions, find common ground and share information.

crossref

datacite

orcid

elsevier

t&f

sage

ror

github

dataverse

zenodo

arxiv

mendelay

zotero

cris systems

funders

openaire

google scholar

unpaywall

share your paper

plos

...

pids

in concert with open scholarly infrastructure, pids allow us to see the big picture through these connections and interactions. it can expose relationships between data and research or institutions and outcomes. it can make research outcomes more discoverable.

when we talk about pids, we’re talking about supporting open infrastructure and free exchange of metadata.

this all sounds FAIR to me!

findable

metadata and data should be easy to find for both humans and computers. machine-readable metadata are essential for automatic discovery of datasets and services.

✅ (meta)data are assigned a globally unique and persistent identifier
data are described with rich metadata
✅ metadata clearly and explicitly include the identifier of the data they describe
✅ (meta)data are registered or indexed in a searchable resource

accessible

once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation.

✅ (meta)data are retrievable by their identifier using a standardized communications protocol
✅ metadata are accessible, even when the data are no longer available

interoperable

data usually need to be integrated with other data. data need to interoperate with applications or workflows for analysis, storage, and processing.

SORTA (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
✅ (meta)data use vocabularies that follow FAIR principles
✅ (meta)data include qualified references to other (meta)data

reusable

metadata and data should be well-described so that they can be replicated and/or combined in different settings.

(meta)data are richly described with a plurality of accurate and relevant attributes

✅ (meta)data are released with a clear and accessible data usage license
(meta)data are associated with detailed provenance
✅ (meta)data meet domain-relevant community standards

we embrace FAIR just by using/supporting PIDs!

FAIR is not-so-secretly all about open scholarly infrastructure!

BUT THERE IS AN ELEPHANT IN THE ROOM

metadata

the metadata we get out of these systems, and its utility, is very much dependent on its quality.

we have a general expression in the metadata universe. that's "garbage in, garbage out."

metadata is kind of everyone's responsibility. researchers, librarians, publishers, registration agencies... everyone has a stake in accurate, usable metadata.

metadata is a very complicated topic i could talk about for twice the length of this talk. any time! let me know! i'll do it.

anyway, listen, i'm very sorry about how fast this all was (not really)!

FAIR Principles & Persistent Identifiers

introduction(s)

i won't sugarcoat this. i am about to cover a lot of ground, very fast.

buckle up!

what is FAIR?

FAIR

unFAIR

you're, like, "i am not a data scientist"

"...we use the phrase ‘(meta)data’ in cases where the principles should be applied to both metadata and data."

i assume FAIR is an acronym?

fair enough!

findable

accessible

interoperable

reusable

what does this have to do with pids?

let's (very quickly) review how pids work

we'll use the doi as our example

dois

dois

how a doi works

how a doi works

how a doi works

how a doi works

publication metadata

this metadata is, as you might have guessed, hugely useful!

no pid is an island

registration agencies

pids for scholarly works

pids for scholarly works

pids for researchers

pids for organizations

registration agencies

what a persistent identifier does

this is open scholarly infrastructure

pids are in the drinking water of scholarly publishing

open scholarly infrastructure

api

api

pids

this all sounds FAIR to me!

findable

accessible

interoperable

reusable

we embrace FAIR just by using/supporting PIDs!

FAIR is not-so-secretly all about open scholarly infrastructure!

BUT THERE IS AN ELEPHANT IN THE ROOM

metadata

anyway, listen, i'm very sorry about how fast this all was (not really)!

thanks!