The American World Gazetteer
or too much text, too little time...
Tim Sherratt ยท @wragge
OCRd text from 83,135 digitised books (30gb)
Never mind the metadata โ just give me everything!
The dataset
What's an API anyway?
Where's the API documentation?
Used the manifest to get list of files
Downloading was straightforward โ 83,135 text files (30gb) in 5 days
Accessing the dataset
Encoding of text files
How many files are there?
OCR quality (of course!)
Not enough time!
Biggest obstacle
Exploring new data!
The challenges of scale (on a budget)
I learnt heaps!
What did I enjoy?
More time!
Clearer processing pipelines
Disambiguation and data cleaning?
What would I have done differently?
The world as seen from America?
7,762,992 places from 34,570 books
Limited to 737 'countries' & linked to Wikidata
1,393,125 sentences extracted from 10,155 books
Book metadata, countries, & sentences saved to SQLite, searchable through
Datasette
American World Gazetteer
app โ random sentences from Datasette
How did you choose to represent the dataset?
What does 'access' and how does it change?
What's possible? Does large-scale analysis need large-scale resources?
Did you have a specific research question?
More LoC examples in the
GLAM Workbench
Mashing up LoC books with
OCRd text from 26,762 Trove books
Disambiguation & data cleaning...
More links (to Wikidata and beyond)
If you had endless amounts of time and money...
Jupyter notebooks
for data download and processing
Code repository
for app
Datasette instance
(on Google Cloudrun)
App
(on GitHub pages)
Outputs