Analyzing bulk OCR Results Among Mixed Typed and Handwritten Documents

Tommy Keswick
Caltech Library

Problems

Errant HTML in OCR text

HTML in Search Results

HTML in Search Results

HTML in Search Results

HTML in Search Results

Problems

Errant HTML in OCR text
Bias towards typed documents in search results

Approaches

analyze existing tesseract results
- run statistics with dictionary words
- throw out junk ocr
generate better ocr/htr with new tools
- google cloud vision
- microsoft azure computer vision

Results

custom analysis
- plotted dictionary results
- spot-checked graphs
- decided on some thresholds
online services
- early exploration
- custom software

Script Data

Ignore Percentage

Dictionary Percentage
(of total)

Dictionary Percentage
(of non-ignored)

Thresholds

dictionary percentage of non-ignored
- >55
dictionary percentage of total
- >35

Online Services

https://hale.archives.caltech.edu/islandora/object/hale:50433/datastream/JPG/view

Next Steps

better text analysis
rerun htr
convert json to hocr

Collaboration

tkeswick@caltech.edu
https://github.com/caltechlibrary/ocr-plotting
https://github.com/caltechlibrary/ocre
https://github.com/caltechlibrary/handprint

Made with Slides.com