Building Infrastructure for Data-Driven Research

Dr. Philipp Zumstein

Mannheim University Library

2017-03-15 Social Science Data Lab, Mannheim

Slides are Open Access, resuse them as

(this does not cover necessarily all the pictures; see individual attributions)

https://github.com/SocialScienceDataLab/
building-infrastructure-for-data-driven-research

Overview

Data-driven Research
Building Infrastructure
OCR Workflow
OCR Software
Applications

Data-driven Research

online data

collected data

other data

Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/

§§

time

methods

What do you do

with images containing text or printed books/newspapers as input?

Digitization, OCR, Structuring

(infrastructure for research)

Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/

Building Infrastructure

Science Support from Library

Infrastructure for Scanning

A1-Scanner for newspapers etc.

V-Scanner for rare, old, fragile books

Digitization, "Data-ization"

Infrastructure digitization of printed material:

V-scanner
A1-scanner
A2-scanner
A3-scanner
conservation checks and fixes

Expertise:

scanning workflow
(manual) double-key-methods
automatic text recognition (OCR)
digitizing microfiche, microfilm
extracting information from CDs to a database
structuring information
metadata formats

Infrastructure Projects

Ancien Droit: digitizing 800 books from the 17th/18th century from the collection of Desbillon with focus on the history of the "Ancien Droit" https://digi.bib.uni-mannheim.de/

Aktienführer I+II: digitizing the annualy published books "Aktienführer", extracting the data in a data base https://digi.bib.uni-mannheim.de/aktienfuehrer/

Reichsanzeiger: German newspaper (government gazette) from 1819 to 1945 https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/

LOC-DB: open, distributed infrastructure for cataloguing of citations https://locdb.bib.uni-mannheim.de/

Infolis I+II: connect research data and publications, text mining scientific articles, integration into different retrieval systems http://infolis.github.io/

OCR Workflow

Workflow of OCR-Process

Baierer, Konstantin; Zumstein, Philipp (2016). Verbesserung der OCR in digitalen Sammlungen von Bibliotheken. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture, v. 4, n. 2, p. 72-83. https://doi.org/10.12685/027.7-4-2-155

Image Processing

deskew

dewarp

binarize

denoise

despeckling

Layout Analysis

text vs. image classification
header, footer, headings
multi-columns, reading order
line recognition

Text Recognition

a) character-based recognition

b) line-based recognition

"ē" : 88%

"é" : 85%

"e" : 73%

"c" : 71%

...

LSTM

"mit Weglassung solcher Verse"

Computerlinguistical Methods

dictionary
bigram, -trigrams, etc. for letters and words

Screenshot from PoCoTo used as CC-BY-SA published in:

CIS München (2016): Abschlussbericht zum Projekt "Ausbau und Erweiterung eines Open-Source-Tools zur Nachkorrektur historischer OCR-erfasster Texte" der CLARIN-D Facharbeitsgruppe 4-3 “Klassische Philologie”

non- words

possible errors

words from the dictionary

possible corrections

Recognition Errors

OCR results have errors
errors can occur in each step
- scanning errors
- segmentatation/layout errors
- recognition errors
- errors in dictionaries
- untrained characters

Advise: Judge the errors with regard to your application (fuzzy search, topic modeling, extracting exact numbers)

OCR-Software

Commercial OCR Software

Open Source OCR Software

ABBYY Finereader

e.g. FineReader Engine 11 CLI for Linux (on one server/pc): 120'000 pages / year for 999 EUR

Tesseract

started 1985 by HP Labs
since 2006 Open Source
supported by Google

Ocropus

started 2007
founded and maintained by Prof. Breuel (DFKI, Google, Nvidia)

etc.


tesseract input.jpg output \ 
    -l eng+deu \
    --oem 1 --psm 7 \
    hocr


abbyyocr11 -rl German \
    -if input.jpg \
    -f PDF -of output.pdf

Normally good results
Closed source, limited options to change behaviour
Strong emphazise on language-dependent dictionaries

Until 2016 character-based text
recognition only, now also neural-
network-based text recognition
Less emphazise on language-
dependent dictionaries
github.com/tesseract-ocr/tesseract, part of linux distrib.
For Windows: github.com/UB-Mannheim/tesseract/wiki
For R: github.com/ropensci/tesseract

ABBYY Finereader

Tesseract

OCRopus

neural network algorithm since 2013
training is key feature
different models for scripts (not languages)
no dictionary
modular scripts (Unix philosophy)


./ocropus-nlbin tests/ersch.png

./ocropus-gpageseg ersch/*.bin.png

./ocropus-rpred ersch/*/*.bin.png \
    -m models/fraktur.pyrnn.gz

./ocropus-hocr ersch/*.bin.png

OCR Fileformats

recognized text
position of the words, lines, characters (bounding boxes)
confidence values
text direction, recognized lanuage, formats, ...

e.g. hocr file:

Other OCR-formats: ALTO, Page XML, ABBYY XML, TEI, GCV

https://github.com/UB-Mannheim/ocr-fileformat

...
<p class='ocr_par' lang='deu' title="bbox930">
  <span class='ocr_line' title="bbox 348 797 1482 838; baseline -0.009 -6">
    <span class='ocrx_word' title='bbox 348 805 402 832; x_wconf 93'>Die</span> 
    <span class='ocrx_word' title='bbox 421 804 697 832; x_wconf 90'>Darlehenssumme</span> 
    <span class='ocrx_word' title='bbox 717 803 755 831; x_wconf 96'>ist</span> 
    <span class='ocrx_word' title='bbox 773 803 802 831; x_wconf 96'>in</span> 
    <span class='ocrx_word' title='bbox 821 803 917 830; x_wconf 96'>ihrem</span> 
    <span class='ocrx_word' title='bbox 935 799 1180 838; x_wconf 95'>ursprünglichen</span> 
    <span class='ocrx_word' title='bbox 1199 797 1343 832; x_wconf 95'>Umfange</span> 
    <span class='ocrx_word' title='bbox 1362 805 1399 823; x_wconf 95'>zu</span> 
    <span class='ocrx_word' title='bbox 1417 x_wconf 96'>ver-</span> 
  </span>
  ...

Applications

-> View this query online

Ngram Viewer (Google Books)

Number of Females in the Supervisory Board of DAX-30 companies 1979-1999

Go to the "Aktienführer Datenarchiv" and there to "Export"
Increase number of results to 50, search for "DAX", click on select all visible (38 results)
Adjust the year range
Select the category "Supervisory Board"
Export the CSV data
Open in Excel, mark the female names
Finally make a pivot table

Number and age of German voters for EU vote 1989

Go to digizeitschriften.de and then to the Statistisches Jahrbuch für die Bundesrepublik Deutschland 1990
Download the pdf of the chapter "Wahlen" starting from page 76
Open the pdf in the PDF X Change Viewer, run OCR and save it (or the alternatives you heard before)
Download Tabula http://tabula.technology/,
install it and run it
Open pdf in Tabula, select table
and extract data as csv

(*) The quality is here not yet optimal, but it shows the possibilities of the tools and data around OCR.

Number of German Emmigrants

from 1870 until 1880

Go to the Reichsanzeiger
Search for "Auswanderer"
Be lucky
Go to the result

Discussion, Questions?

OCRopus run-test executes nlbin, gpageseg, rpred

List of Images

Slide 1: https://pixabay.com/de/hong-kong-stadt-st%C3%A4dtischen-1990268/ (CC0)
Slide 3.2: Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/ (CC-BY)
Slide 3.4: Copyright User (2013-06): Text and Data Mining (Original Illustration by Davide Bonazzi) http://copyrightuser.org/topics/text-and-data-mining/ (CC-BY), https://pixabay.com/de/b%C3%BCcher-stapel-bildung-lesung-41930/ (CC0), https://pixabay.com/de/zeitung-artikel-zeitschrift-154444/ (CC0)
Slide 4.3: The two images of our scanners are made by the Mannheim University Library 2017 (can be used as CC-BY)
Slide 5.2: Baierer, Konstantin; Zumstein, Philipp (2016). Verbesserung der OCR in digitalen Sammlungen von Bibliotheken. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture, v. 4, n. 2, p. 72-83. https://doi.org/10.12685/027.7-4-2-155 (CC-BY)
Slide 5.3 and 5.4: Images created for this talk (CC0)
Slide 5.5: LSTM http://www.asimovinstitute.org/neural-network-zoo/ (CC0)
Slide 5.6: Screenshot from PoCoTo (CC-BY-SA) published in: CIS München (2016): Abschlussbericht zum Projekt "Ausbau und Erweiterung eines Open-Source-Tools zur Nachkorrektur historischer OCR-erfasster Texte" der CLARIN-D Facharbeitsgruppe 4-3 “Klassische Philologie”
Slide 5.7: Baierer, Konstantin; Zumstein, Philipp (2016). Verbesserung der OCR in digitalen Sammlungen von Bibliotheken. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture, v. 4, n. 2, p. 72-83. https://doi.org/10.12685/027.7-4-2-155 (CC-BY)
Slide 6.2: https://pixabay.com/de/beutel-geld-reichtum-einnahmen-147782/ (CC0), ) https://pixabay.com/de/quell-offene-software-offene-software-1518247/ (CC0), https://pixabay.com/de/sicher-metall-metallischen-ger%C3%A4t-298244/ (CC0)
Several logos and screenshots