ANU Archives

Sydney Stock Exchange

Tim Sherratt ・@wragge

These slides

https://slides.com/wragge/stock-exchange-project

Experimenting with

Jupyter
computer vision (OpenCV)
Cloudstor (Sync, WebDAV, rclone)
Transkribus (app & API)
Zooniverse (app & API)
Tesseract
Textract (API)

Challenges

Scale – 199 volumes, 70,000 100mb TIFF files
Structure – rows & columns, print & handwriting
Variability – changes in column numbers, handwriting, ink, paper

The same,

but different...

Two sets of data

page metadata (volumes, dates, sessions)
page content (print, handwriting, grid)

To make structured data

Compile metadata
Extract text (print & handwritten)
Preserve table grid

Text + grid

Extract grid from page features & use it to guide the OCR/HTR?
Use positional data from OCR/HTR to reconstruct grid?
Or a bit of both?

Finding columns

page width

column widths

gutter

Column widths

Volumes 1 – 100

Problems?

Using headers

session

date

Predicted pages

+ pages per day
- known holidays
= ?

pretty close?

Checking dates

NSW holidays
1901-1950

https://glam-workbench.net/anu-archives/

Pages, sessions, & dates

https://glam-workbench.net/anu-archives/

Pages per day

https://glam-workbench.net/anu-archives/

Find pages by date

https://glam-workbench.net/anu-archives/

Tesseract OCR & rows

Textract OCR, HTR, & rows

Try Textract

Tesseract

Textract

$

Problems

the upside down
seeing too much
what's actually going on?

Line

Word

Extracting

structures

Rows & columns

Visualising activity

?

Datasette on GCloud

https://glam-workbench.net/anu-archives/

https://github.com/wragge/sydney-stock-exchange

More...

Stock exchange project

By Tim Sherratt

Stock exchange project

1,620

Tim Sherratt PRO

Historian and hacker. All the slide decks available here are licensed under a Creative Commons Attribution 4.0 International License. Fee free to reuse and share!