ANU Archives

Sydney Stock Exchange

These slides

  • Create a dataset that supports computational analysis
  • Extract content of pages while preserving the structure

Aims

Challenges

  • Scale – 199 volumes, 70,000 100mb TIFF files
  • Structure – rows & columns, print & handwriting
  • Variability – changes in column numbers, handwriting, ink, paper

The same,

but different...

Two sets of data

  • page metadata (images, volumes, dates, sessions)
  • page content (print, handwriting, grid)

To make structured data

  • Compile metadata
  • Extract text (print & handwritten)
  • Reassemble table grid

Finding columns

page width

column widths

gutter

Column widths

Problems?

Dates & sessions

session

date

Checking dates

NSW holidays
1901-1950

Pages, sessions, & dates

Pages per day

Extracting content & structure

Try Textract

Line

Word

Extracting

structures

Rows & columns

Visualising activity

?

Datasette on GCloud

But beware!

  • interface under development
  • many problems with the data
  • things will change!

Next steps...

More...