Understanding Trove

https://slides.com/wragge/aha-2024

please steal these slides!

https://headlineroulette.net/?id=213102227

2009–2012

2013–2015

2016–

Trove has a history

Single Business Discovery Project

Trove

structures change

https://tdg.glam-workbench.net/what-is-trove/categories-and-zones.html

2008

2024

content changes

2022

2011

Trove is constructed

using critically

context?

content?

https://tdg.glam-workbench.net/

Trove Data Guide content

Explanation – why is Trove like this?
Documentation – what you need to know
How to – complete a specific task
Tutorials – learn methods, develop skills
inspired by Diátaxis

https://ardc.edu.au/services/ardc-community-data-lab/

https://glam-workbench.net/

Community Data Lab

Trove Data Guide

GLAM Workbench

architectures

standards

technologies

principles

context & content

https://tdg.glam-workbench.net/what-is-trove/links-and-identifiers.html

how many digitised newspaper articles are currently in Trove?

https://tdg.glam-workbench.net/understanding-search/search-hacks.html

try it!

go to Trove's newspapers category
enter any keyword (it doesn't matter what it is)
look at the url in your browser's location bar and find the part of the url that looks like:
?keyword=[your keyword]
delete the part after the = sign and hit enter

bonus points!

add &pageSize=100 to the url in your browser's location bar and hit enter
what happens?

official
Trove
hacker

handy with lists

an aggregation of collection metadata
a repository of digitised content
an archive of Australian web content from 1996 onwards
aggregated identity records for people and organisations
born-digital publications submitted via eLegal Deposit
a platform for user engagement
a series of APIs for delivering machine-actionable data

Trove is not one thing...

Trove's categories

starting at the top!

Books & Libraries
Diaries, Letters & Archives
Images, Maps & Artefacts
Lists
Magazines & Newsletters
Music, Audio & Video
Newspapers & Gazettes
People & Organisations
Research & Reports
Websites

separate systems /

specific types of things

Books & Libraries
Diaries, Letters & Archives
Images, Maps & Artefacts
Lists
Magazines & Newsletters
Music, Audio & Video
Newspapers & Gazettes
People & Organisations
Research & Reports
Websites

aggregated metadata
&
digitised resources

Books & Libraries
Diaries, Letters & Archives
Images, Maps & Artefacts
Lists
Magazines & Newsletters
Music, Audio & Video
Newspapers & Gazettes
People & Organisations
Research & Reports
Websites

like newspapers, but not...

formats by category

https://tdg.glam-workbench.net/what-is-trove/categories-and-zones.html

categories are containers

categories are contexts for discovery

Trove is designed for discovery not analysis

works and versions

https://tdg.glam-workbench.net/what-is-trove/works-and-versions.html

https://trove.nla.gov.au/work/158465667

the wrong Wiggles

https://trove.nla.gov.au/work/195172587

one work, 106 different press conferences

https://trove.nla.gov.au/work/10431978

the same, but different....

collection items as 'versions'

https://trove.nla.gov.au/work/163048354

collections within collections

https://tdg.glam-workbench.net/what-is-trove/collections.html

https://nla.gov.au/nla.obj-147116770

https://nla.gov.au/nla.obj-147116890

https://nla.gov.au/nla.obj-140670968

does this matter?

using critically

context?

content?

understanding search

https://tdg.glam-workbench.net/understanding-search/index.html

search is a research method

Understand the technical context — How does it work? Consult the documentation (and this Guide) to understand your options
Be creative and strategic — Solve your puzzle by experimenting and looking for clues in the results
Stay critical — Always assume that Trove isn’t telling you everything

https://tdg.glam-workbench.net/understanding-search/index.html

simple search isn't...

de-fuzzify searches

https://tdg.glam-workbench.net/understanding-search/simple-search-options.html

"isPartOf": [
  {
    "value": "Australian ephemera collection (Programs and invitations)",
    "type": "series"
  }
]

using indexes

https://tdg.glam-workbench.net/what-is-trove/collections.html

search the isPartOf values for "ephemera"

https://tdg.glam-workbench.net/understanding-search/simple-search-options.html

https://tdg.glam-workbench.net/understanding-search/date-searches.html

date searches

what are we searching?

https://tdg.glam-workbench.net/newspapers-and-gazettes/newspaper-corpus.html

change over time

https://wragge.github.io/trove-newspaper-totals/

7,518,764 articles added in 2023

https://updates.timsherratt.org/2024/01/02/trove-newspapers-in.html

https://wragge.github.io/trove-newspaper-totals/

https://troveplaces.herokuapp.com/map/

newspaper locations

what's missing?

https://tdg.glam-workbench.net/newspapers-and-gazettes/newspaper-corpus.html

https://glam-workbench.net/trove-newspapers/Analysing_OCR_corrections/

OCR corrections

https://gist.github.com/wragge/9aa385648cff5f0de0c7d4837896df97

non-English language newspapers

not just newspapers

20,000 books (and ephemera)
900 periodicals containing 37,000 issues
30,000 maps
24,000 Parliamentary Papers
6,000 oral histories
85,000 web page titles
7,000 born-digital periodicals containing 150,000 issues

more than...

where are they?

https://tdg.glam-workbench.net/other-digitised-resources/index.html

try it!

go to the Images, Maps & Artefacts category
search for "nla.obj" (with the quotes)
select 'Online' from the 'Access' facet
add additional keywords or facets!
for example here are digitised posters

books

21,218 'books'
17,695 with OCR
1,473,339 pages

https://tdg.glam-workbench.net/other-digitised-resources/books/overview.html

🔭 explore

periodicals

https://tdg.glam-workbench.net/other-digitised-resources/periodicals/overview.html

908 titles
37,015 issues

🔭 explore

6,202 online
1,781 transcripts
15,107 hours

oral histories

https://tdg.glam-workbench.net/other-digitised-resources/oral-histories/overview.html

🔭 explore

Parliamentary Papers

24,990 publications
2,448,522 pages
4 gb of OCRd text

https://tdg.glam-workbench.net/other-digitised-resources/parliamentary-papers/overview.html

🔭 explore

Finding Parliamentary Papers

https://tdg.glam-workbench.net/other-digitised-resources/parliamentary-papers/finding-pp.html

maps

35,042 'single' maps
30,344 high-res TIFFs
14.41 TB of images
28,205 with coordinates

https://glam-workbench.net/trove-maps/

🔭 explore

NED periodicals

7,973 periodicals
156,151 issues
154,976 PDFs
138,557 full access

🔭 explore

https://glam-workbench.net/trove-journals/harvest-ned-periodicals/

websites

> 8 billion pages
87,757 selected titles
149 subjects
1,920 collections

🔭 explore

https://glam-workbench.net/trove-web-archives/

BREAK

what data?

metadata
text
images
sound
born digital objects
user generated
system statistics

{

what data?

metadata

{

catalogue entries
authority records
library holdings
results of processing (eg OCR coordinates)

text

{

created by OCR / HTR
corrected by users
extracted from web pages
oral history transcripts
titles, abstracts

images

{

created by digitisation (photos, maps, book pages, manuscripts)
born digital (via Flickr)

sound

{

digitised and born digital oral history recordings

born digital

{

web pages (including images, PDFs, videos)
web harvest metadata
ePubs (via legal deposit)

user generated

{

tags
comments
lists
corrections

system stats

{

infer totals from search results
contributors

exploring scale
analysing content
annotation and enrichment
creating collections

beyond Trove's web interface 🚀

why data?

https://glam-workbench.net/trove-newspapers/querypic/

Querypic

19 million articles

https://updates.timsherratt.org/2023/08/08/exploring-the-front.html

https://tdg.glam-workbench.net/pathways/text/newspapers-keywords.html

https://tdg.glam-workbench.net/pathways/images/examples.html

image workspaces

https://wragge.github.io/federation-papers/

try it!

https://tdg.glam-workbench.net/pathways/geospatial/maps-to-ghap.html

https://tdg.glam-workbench.net/pathways/collections/collectionbuilder.html

accessing data

https://tdg.glam-workbench.net/accessing-data/using-web-interface.html

https://www.zotero.org/

data from the web interface

downloading as 'image' delivers an HTML page

limit of 20
backs missing
(no sub-collections)

low resolution (1000px x 1588px)

missing metadata

limited metadata
no full text
< 1 million results

Scaling up?

text from all articles in a newspaper search
all covers from a journal
all images from a finding aid
text from all issues in a journal
all digitised maps of Australia

creating datasets

{

      "id": "61389505",
      "url": "https://api.trove.nla.gov.au/v3/newspaper/61389505",
      "heading": "MR. WRAGGE'S \"WRAGGE.\"",
      "category": "Article",
      "title": {
            "id": "64",
            "title": "Clarence and Richmond Examiner (Grafton, NSW : 1889 - 1915)"
      },
      "date": "1902-07-15",
      "page": "4",
      "pageSequence": "4",
      "troveUrl": "https://nla.gov.au/nla.news-article61389505"

}

https://tdg.glam-workbench.net/accessing-data/trove-api-intro.html

Trove API

use the API, but...

skills?
documentation?
examples?
tools?

🤯

API limits

items in a digitised collection
links to download images
text from books or periodical issues
text from Australian Women's Weekly

🐞 and bugs!

🪠 data plumbing

gaps & blockages

researchers 👩‍🔧

collaboration

https://wragge.github.io/trove_newspaper_images/

https://tdg.glam-workbench.net/newspapers-and-gazettes/data/articles.html

articles as images

try it!

find a newspaper article (here's one)
copy the url of the newspaper article from your browser's location bar
go to the newspaper image app in the GLAM Workbench
paste the url into the box
click on Get images
select 'Save image as...'
from right click menu

check 'mask image' and try again – what changes?

https://tdg.glam-workbench.net/other-digitised-resources/how-to/download-images.html

https://tdg.glam-workbench.net/accessing-data/how-to/download-higher-resolution-images.html

save
high-res images

try it!

search for a digitised photo (remember the "nla.obj" trick)
here's an example if you can't think of anything to search for
click on the View link to load the image in the digitised item viewer
change the url suffix from /view to /image in your browser's location bar
click enter to view the image
choose 'Save image as...' from the right click menu to download the image

archived websites by topic

harvesting digitised resources

https://tdg.glam-workbench.net/other-digitised-resources/how-to/harvest-digitised-resources.html

https://tdg.glam-workbench.net/pathways/index.html

more data sources

🔭 explore

https://glam-workbench.net/trove-journals/bulletin-cartoons-collection/

3,471 cartoons
1886 to 1962

you can help!

add notes, tags, links....

add ideas!

https://github.com/wragge/trove-data-guide/discussions/categories/ideas

https://github.com/wragge/trove-data-guide/issues

report problems!

share everything!

openly licensed to encourage reuse
use for your own research
use in teaching & research training
no need to ask for permission

steal everything!

pay it forward!

stay connected!

https://ardc.edu.au/hass-and-indigenous-research-data-commons/

stay connected!

https://updates.timsherratt.org/

Tim Sherratt

email: tim@timsherratt.au

web: timsherratt.au

mastodon: @wragge@hcommons.social

updates: https://updates.timsherratt.org/

Understanding Trove workshop

By Tim Sherratt

Understanding Trove workshop

Presented at the Australian Historical Association Annual Conference, 1 July 2024

2,225

Tim Sherratt PRO

Historian and hacker. All the slide decks available here are licensed under a Creative Commons Attribution 4.0 International License. Fee free to reuse and share!

Understanding Trove

https://slides.com/wragge/aha-2024

please steal these slides!

2009–2012

2013–2015

2016–

Trove has a history

Single Business Discovery Project

Trove

structures change

2008

2024

content changes

Trove is constructed

using critically

context?

content?

Trove Data Guide content

Community Data Lab

Trove Data Guide

GLAM Workbench

context & content

how many digitised newspaper articles are currently in Trove?

try it!

bonus points!

Trove is not one thing...

Trove's categories

separate systems /

specific types of things

aggregated metadata & digitised resources

like newspapers, but not...

formats by category

categories are containers

categories are contexts for discovery

Trove is designed for discovery not analysis

works and versions

the wrong Wiggles

one work, 106 different press conferences

the same, but different....

collection items as 'versions'

collections within collections

does this matter?

using critically

context?

content?

understanding search

search is a research method

simple search isn't...

de-fuzzify searches

using indexes

date searches

what are we searching?

change over time

7,518,764 articles added in 2023

newspaper locations

what's missing?

OCR corrections

non-English language newspapers

not just newspapers

more than...

where are they?

try it!

books

periodicals

oral histories

Parliamentary Papers

Finding Parliamentary Papers

maps

NED periodicals

websites

BREAK

what data?

what data?

metadata

text

images

sound

born digital

user generated

system stats

aggregated metadata
&
digitised resources

save
high-res images