Research Data Management for Humanists

Raw Data

Easy way - Tropy
- Released late 2017
- Same folks as Zotero and Omeka
- Out of box management of archival photographs
Hard way
- OCR
- Automatic classification/metadata creation (topic models?)
- Browse/full text search interface
- Test collection: ~16,000 jpegs, chronologically ordered
OCR - Tesseract
- Version 4 much better than Version 3
OCR Preprocessing


<
ScanTailor
- GUI or command line
- Autocrops and handles book images
- Unrotates
- Convert to B&W
- Not maintained :(
Topic Modelling - Mallet
OCR quality pretty low
But, topic modelling actually kind of worked
0 0.5 tre whe pus pur wor ant war hhe art ane pue ore oun thy ami oat ter tie wat ame
1 0.5 chinese year china labour malaya work opium malay states tin years number state perak selangor immigration rubber report coolies price
2 0.5 tho tha wore thoe lhe party local whioh wan tang thore kmt min kuo aro ard amt thoy aml bean
3 0.5 time place house day made small road side river miles sea men island people country back found man water town
4 0.5 tae und tne tie long white car black red tre sort top ani thin open trees blue aad men snd
5 0.5 chinese time good day evening club dinner home morning man people house round amoy night party left bit didn‘t sunday
6 0.5 chinese government governor ordinance council state dated sir straits singapore subject settlements despatches despatch law time report present reply made
7 0.5 societies police year society secret members chinese report annual persons penang men cases singapore gang banished crime number reports trouble
8 0.5 banishment july reply june china tan singapore reports april chinese feb jan lim list sept dec colony governor nov life
9 0.5 letter time don‘t i‘m dad mother betty week good things letters i‘ve home give write dear book days love office
Needs
- Automatic OCR Enhancement
- Index, full-text search for collection of .txt files
- fuzzy search esp. important
Links
Slides: slides.com/jdingle/c4l18n
deck
By jdingle
deck
- 478