Research Data Management for Humanists

Raw Data

Easy way - Tropy

  • Released late 2017
  • Same folks as Zotero and Omeka
  • Out of box management of archival photographs

Hard way

  • OCR
  • Automatic classification/metadata creation (topic models?)
  • Browse/full text search interface
  • Test collection: ~16,000 jpegs, chronologically ordered

OCR - Tesseract

  • Version 4 much better than Version 3

OCR Preprocessing

<

ScanTailor

  • GUI or command line
  • Autocrops and handles book images
  • Unrotates
  • Convert to B&W
  • Not maintained :(

Topic Modelling - Mallet

OCR quality pretty low

But, topic modelling actually kind of worked

0    0.5    tre whe pus pur wor ant war hhe art ane pue ore oun thy ami oat ter tie wat ame
1    0.5    chinese year china labour malaya work opium malay states tin years number state perak selangor immigration rubber report coolies price
2    0.5    tho tha wore thoe lhe party local whioh wan tang thore kmt min kuo aro ard amt thoy aml bean
3    0.5    time place house day made small road side river miles sea men island people country back found man water town
4    0.5    tae und tne tie long white car black red tre sort top ani thin open trees blue aad men snd
5    0.5    chinese time good day evening club dinner home morning man people house round amoy night party left bit didn‘t sunday
6    0.5    chinese government governor ordinance council state dated sir straits singapore subject settlements despatches despatch law time report present reply made
7    0.5    societies police year society secret members chinese report annual persons penang men cases singapore gang banished crime number reports trouble
8    0.5    banishment july reply june china tan singapore reports april chinese feb jan lim list sept dec colony governor nov life
9    0.5    letter time don‘t i‘m dad mother betty week good things letters i‘ve home give write dear book days love office

 

Needs

  • Automatic OCR Enhancement
  • Index, full-text search for collection of .txt files
    • fuzzy search esp. important

Links

Slides: slides.com/jdingle/c4l18n

deck

By jdingle

deck

  • 520