Trove Data Guide

ARDC Community Data Lab

CDL work package

  • Trove Data Guide
  • RO-Crate integrations

but also...

Trove API v3

RO-Crate

  • RO-Crate: packaging research objects with JSON-LD metadata
  • used by other projects within HASS RDC, such as LDaCA
  • supports discovery, reuse, integration

RO-Crate integrations

Trove Newspaper Harvester

  • every harvest generates an RO-Crate file
  • new config file with tool and query parameters
  • links tool, configuration, query, and dataset
  • captures the context of a harvest
  • easy to re-run a harvest
{
    "@context": "https://w3id.org/ro/crate/1.1/context",
    "@graph": [
        {
            "@id": "ro-crate-metadata.json",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            },
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            },
            "license": {
                "@id": "https://creativecommons.org/publicdomain/zero/1.0/"
            }
        },
        {
            "@id": "./",
            "@type": "Dataset",
            "datePublished": "2023-10-23T05:02:01+00:00",
            "description": "This dataset of digitised newspaper articles from Trove was created using the Trove Newspaper Harvester. Details of the search query used to generate this dataset can be found in the harvester_config.json file.",
            "hasPart": [
                {
                    "@id": "harvester_config.json"
                },
                {
                    "@id": "text"
                },
                {
                    "@id": "results.csv"
                }
            ],
            "mainEntity": {
                "@id": "#harvester_run"
            },
            "name": "Dataset of digitised newspaper articles harvested from Trove on 23 October 2023"
        },
        {
            "@id": "harvester_config.json",
            "@type": "File",
            "encodingFormat": "application/json",
            "name": "Trove Newspaper Harvester configuration file"
        },
        {
            "@id": "text",
            "@type": [
                "File",
                "Dataset"
            ],
            "dateCreated": "2023-10-23T16:02:30.929438+11:00",
            "description": "There is one text file per article. The file titles include basic article metadata \u2013 the date of the article, the id number of the newspaper, and the id number of the article.",
            "license": {
                "@id": "http://rightsstatements.org/vocab/CNE/1.0/"
            },
            "name": "Text files harvested from articles",
            "size": 272
        },
        {
            "@id": "results.csv",
            "@type": [
                "File",
                "Dataset"
            ],
            "contentSize": 80336,
            "dateCreated": "2023-10-23T16:02:30.944094+11:00",
            "encodingFormat": "text/csv",
            "license": {
                "@id": "http://rightsstatements.org/vocab/NKC/1.0/"
            },
            "name": "Metadata of harvested articles in CSV format",
            "size": 272
        },
        {
            "@id": "#harvester_run",
            "@type": "CreateAction",
            "actionStatus": {
                "@id": "http://schema.org/CompletedActionStatus"
            },
            "endDate": "2023-10-23T16:02:30.929438+11:00",
            "instrument": "https://github.com/wragge/trove-newspaper-harvester",
            "name": "Run of harvester",
            "object": "harvester_config.json",
            "result": [
                {
                    "@id": "text"
                },
                {
                    "@id": "results.csv"
                }
            ],
            "startDate": "2023-10-23T16:02:01.306088+11:00"
        },
        {
            "@id": "https://github.com/wragge/trove-newspaper-harvester",
            "@type": "SoftwareApplication",
            "description": "The Trove Newspaper (& Gazette) Harvester makes it easy to download large quantities of digitised articles from Trove\u2019s newspapers and gazettes.",
            "documentation": "https://wragge.github.io/trove-newspaper-harvester/",
            "name": "Trove Newspaper and Gazette Harvester",
            "softwareVersion": "0.7.2",
            "url": "https://github.com/wragge/trove-newspaper-harvester"
        },
        {
            "@id": "http://rightsstatements.org/vocab/NKC/1.0/",
            "@type": "CreativeWork",
            "description": "The organization that has made the Item available reasonably believes that the Item is not restricted by copyright or related rights, but a conclusive determination could not be made.",
            "name": "No Known Copyright",
            "url": "http://rightsstatements.org/vocab/NKC/1.0/"
        },
        {
            "@id": "http://rightsstatements.org/vocab/CNE/1.0/",
            "@type": "CreativeWork",
            "description": "The copyright and related rights status of this Item has not been evaluated.",
            "name": "Copyright Not Evaluated",
            "url": "http://rightsstatements.org/vocab/CNE/1.0/"
        },
        {
            "@id": "https://creativecommons.org/publicdomain/zero/1.0/",
            "@type": "CreativeWork",
            "name": "CC0 Public Domain Dedication",
            "url": "https://creativecommons.org/publicdomain/zero/1.0/"
        }
    ]
}

GLAM Workbench repos

  • template repository generates skeleton RO-Crate
  • script to update RO-Crate after changes
  • reads metadata from notebooks
  • working through existing repositories
{
    "@context": "https://w3id.org/ro/crate/1.1/context",
    "@graph": [
        {
            "@id": "./",
            "@type": "Dataset",
            "author": [
                {
                    "@id": "0000-0001-7956-4498"
                }
            ],
            "datePublished": "2023-10-25",
            "description": "A GLAM Workbench repository",
            "hasPart": [
                {
                    "@id": "newspaper_harvester_app.ipynb"
                },
                {
                    "@id": "Using-TroveHarvester-to-get-newspaper-articles-in-bulk.ipynb"
                },
                {
                    "@id": "Explore-harvested-text-files.ipynb"
                },
                {
                    "@id": "display_harvest_results_using_datasette.ipynb"
                },
                {
                    "@id": "Exploring-your-TroveHarvester-data.ipynb"
                },
                {
                    "@id": "harvest-specific-days.ipynb"
                }
            ],
            "license": {
                "@id": "https://spdx.org/licenses/MIT"
            },
            "name": "trove-newspaper-harvester",
            "url": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "version": "v2.0.1"
        },
        {
            "@id": "ro-crate-metadata.json",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            },
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            },
            "license": {
                "@id": "https://creativecommons.org/publicdomain/zero/1.0/"
            }
        },
        {
            "@id": "newspaper_harvester_app.ipynb",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/0000-0001-7956-4498"
                }
            ],
            "codeRepository": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "conformsTo": {
                "@id": "https://purl.archive.org/textcommons/profile#Notebook"
            },
            "description": "",
            "encodingFormat": "application/x-ipynb+json",
            "name": "Trove Newspaper & Gazette Harvester",
            "programmingLanguage": {
                "@id": "https://www.python.org/downloads/release/python-31012/"
            }
        },
        {
            "@id": "Using-TroveHarvester-to-get-newspaper-articles-in-bulk.ipynb",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/0000-0001-7956-4498"
                }
            ],
            "codeRepository": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "conformsTo": {
                "@id": "https://purl.archive.org/textcommons/profile#Notebook"
            },
            "description": "",
            "encodingFormat": "application/x-ipynb+json",
            "name": "Using TroveHarvester to get newspaper and gazette articles in bulk",
            "programmingLanguage": {
                "@id": "https://www.python.org/downloads/release/python-31012/"
            }
        },
        {
            "@id": "Explore-harvested-text-files.ipynb",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/0000-0001-7956-4498"
                }
            ],
            "codeRepository": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "conformsTo": {
                "@id": "https://purl.archive.org/textcommons/profile#Notebook"
            },
            "description": "",
            "encodingFormat": "application/x-ipynb+json",
            "name": "Explore harvested text files",
            "programmingLanguage": {
                "@id": "https://www.python.org/downloads/release/python-31012/"
            }
        },
        {
            "@id": "display_harvest_results_using_datasette.ipynb",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/0000-0001-7956-4498"
                }
            ],
            "codeRepository": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "conformsTo": {
                "@id": "https://purl.archive.org/textcommons/profile#Notebook"
            },
            "description": "",
            "encodingFormat": "application/x-ipynb+json",
            "name": "Display the results of a harvest as a searchable database using Datasette",
            "programmingLanguage": {
                "@id": "https://www.python.org/downloads/release/python-31012/"
            }
        },
        {
            "@id": "Exploring-your-TroveHarvester-data.ipynb",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/0000-0001-7956-4498"
                }
            ],
            "codeRepository": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "conformsTo": {
                "@id": "https://purl.archive.org/textcommons/profile#Notebook"
            },
            "description": "",
            "encodingFormat": "application/x-ipynb+json",
            "name": "Exploring your harvested data",
            "programmingLanguage": {
                "@id": "https://www.python.org/downloads/release/python-31012/"
            }
        },
        {
            "@id": "harvest-specific-days.ipynb",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/0000-0001-7956-4498"
                }
            ],
            "codeRepository": "https://github.com/GLAM-Workbench/trove-newspaper-harvester",
            "conformsTo": {
                "@id": "https://purl.archive.org/textcommons/profile#Notebook"
            },
            "description": "",
            "encodingFormat": "application/x-ipynb+json",
            "name": "Harvesting articles that mention \"Anzac Day\" on Anzac Day",
            "programmingLanguage": {
                "@id": "https://www.python.org/downloads/release/python-31012/"
            }
        },
        {
            "@id": "https://orcid.org/0000-0001-7956-4498",
            "@type": "Person",
            "name": "Sherratt, Tim"
        },
        {
            "@id": "https://spdx.org/licenses/MIT",
            "@type": "CreativeWork",
            "name": "MIT License",
            "url": "https://spdx.org/licenses/MIT.html"
        },
        {
            "@id": "https://creativecommons.org/publicdomain/zero/1.0/",
            "@type": "CreativeWork",
            "name": "CC0 Public Domain Dedication",
            "url": "https://creativecommons.org/publicdomain/zero/1.0/"
        },
        {
            "@id": "https://www.python.org/downloads/release/python-31012/",
            "@type": [
                "ComputerLanguage",
                "SoftwareApplication"
            ],
            "name": "Python 3.10.12",
            "url": "https://www.python.org/downloads/release/python-31012/",
            "version": "3.10.12"
        },
        {
            "@id": "#create_version_v2_0_1",
            "@type": "UpdateAction",
            "actionStatus": {
                "@id": "http://schema.org/CompletedActionStatus"
            },
            "endDate": "2023-10-25",
            "name": "Create version v2.0.1"
        }
    ]
}

Trove Data Guide

The guide to Trove data will document in detail the types of content available through Trove, and what data is accessible for each content type. It will describe both the possibilities and limits of Trove data, enabling researchers to develop a critical understanding of Trove as a source for digital research.

Content types

  • Explanationwhy is Trove like this?
  • Documentationwhat you need to know
  • How tocomplete a specific task
  • Tutorialslearn methods, develop skills
  • inspired by Diátaxis

Explanation

Documentation

How to

Tutorials

  • integrate with other CDL and HASS RDC projects
  • respond to researcher needs in phase 2?

Technical details

  • created using Jupyter Book
  • generates a static site from Jupyter notebooks
  • mix of narrative and executable content with embedded code examples and visualisations

Updates and versions

User feedback/engagement

  • bug reports and enhancements via GitHub issues
  • annotations using Hypothesis
  • comments via Utterances
  • direct contributions via GitHub
  • open license to encourage reuse

Trove Data Guide

By Tim Sherratt

Trove Data Guide

  • 183