Dr. Philipp Zumstein
Mannheim University Library
2017-03-15 Social Science Data Lab, Mannheim
Slides are Open Access, resuse them as
(this does not cover necessarily all the pictures; see individual attributions)
online data
collected data
other data
Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/
§§
?
time
methods
Digitization, OCR, Structuring
(infrastructure for research)
Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/
A1-Scanner for newspapers etc.
V-Scanner for rare, old, fragile books
Infrastructure digitization of printed material:
Expertise:
Ancien Droit: digitizing 800 books from the 17th/18th century from the collection of Desbillon with focus on the history of the "Ancien Droit" https://digi.bib.uni-mannheim.de/
Aktienführer I+II: digitizing the annualy published books "Aktienführer", extracting the data in a data base https://digi.bib.uni-mannheim.de/aktienfuehrer/
Reichsanzeiger: German newspaper (government gazette) from 1819 to 1945 https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/
LOC-DB: open, distributed infrastructure for cataloguing of citations https://locdb.bib.uni-mannheim.de/
Infolis I+II: connect research data and publications, text mining scientific articles, integration into different retrieval systems http://infolis.github.io/
Baierer, Konstantin; Zumstein, Philipp (2016). Verbesserung der OCR in digitalen Sammlungen von Bibliotheken. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture, v. 4, n. 2, p. 72-83. https://doi.org/10.12685/027.7-4-2-155
deskew
dewarp
binarize
denoise
despeckling
a) character-based recognition
b) line-based recognition
"ē" : 88%
"é" : 85%
"e" : 73%
"c" : 71%
...
LSTM
"mit Weglassung solcher Verse"
Screenshot from PoCoTo used as CC-BY-SA published in:
non- words
=
possible errors
words from the dictionary
=
possible corrections
Advise: Judge the errors with regard to your application (fuzzy search, topic modeling, extracting exact numbers)
ABBYY Finereader
e.g. FineReader Engine 11 CLI for Linux (on one server/pc): 120'000 pages / year for 999 EUR
Tesseract
Ocropus
etc.
tesseract input.jpg output \
-l eng+deu \
--oem 1 --psm 7 \
hocr
abbyyocr11 -rl German \
-if input.jpg \
-f PDF -of output.pdf
./ocropus-nlbin tests/ersch.png
./ocropus-gpageseg ersch/*.bin.png
./ocropus-rpred ersch/*/*.bin.png \
-m models/fraktur.pyrnn.gz
./ocropus-hocr ersch/*.bin.png
e.g. hocr file:
Other OCR-formats: ALTO, Page XML, ABBYY XML, TEI, GCV
...
<p class='ocr_par' lang='deu' title="bbox930">
<span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
<span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span>
<span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span>
<span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span>
<span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span>
<span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span>
<span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span>
<span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span>
<span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span>
<span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span>
</span>
...
(*) The quality is here not yet optimal, but it shows the possibilities of the tools and data around OCR.
OCRopus run-test executes nlbin, gpageseg, rpred