ANU Archives

Sydney Stock Exchange

These slides

Experimenting with

Challenges

  • Scale – 199 volumes, 70,000 100mb TIFF files
  • Structure – rows & columns, print & handwriting
  • Variability – changes in column numbers, handwriting, ink, paper

The same,

but different...

Two sets of data

  • page metadata (volumes, dates, sessions)
  • page content (print, handwriting, grid)

To make structured data

  • Compile metadata
  • Extract text (print & handwritten)
  • Preserve table grid

Text + grid

  • Extract grid from page features & use it to guide the OCR/HTR?
  • Use positional data from OCR/HTR to reconstruct grid?
  • Or a bit of both?

Finding columns

page width

column widths

gutter

Column widths

Volumes 1 – 100

Problems?

Using headers

session

date

Predicted pages

  • + pages per day
  • - known holidays
  • = ?

pretty close?

Checking dates

NSW holidays
1901-1950

Pages, sessions, & dates

Pages per day

Find pages by date

Tesseract OCR & rows

Textract OCR, HTR, & rows

Try Textract

Tesseract

Textract

$

Problems

  • the upside down
  • seeing too much
  • what's actually going on?

Line

Word

Extracting

structures

Rows & columns

Visualising activity

?

Datasette on GCloud

More...

Stock exchange project

By Tim Sherratt

Stock exchange project

  • 949