Master's Thesis
Author:
Supervisor:
Bc. Marián Skrip
Mgr. Juraj Holas
Processing of administrative documents using machine learning
Goals
Designing and developing a system capable of separating documents during bulk scanning of administrative documents to reduce the number of manual steps necessary in traditional methods of digitizing paper documents
Results
Promising Flags
- CWSSIM
- Page number
- Page look
Epoch accuracy of PageLookFlag during training
- single-page documents
- long documents
- biased towards over-segmenting
Confisusion matrix for CWSSIM flag
CWSSIM Flag
- single-page documents -> no page number
- mainly starting pages correct
Confisusion matrix for Page number flag
Page number flag
- non-analytical
- validation accuracy 85.3%
- not precise real-world predictions
Page look flag
Confisusion matrix for Page look flag
Unsuccessful Flags
- Barcode flag - sparse useful barcodes
- Header / Footer flag - nothing to approve with CWSSIM flag
Confusion matrix for Barcode flag
Master model
Master model accuracy during training on base dataset
Master model accuracy during training on even dataset
- Final accuracy of 81.25%
- Not usable in production
- Lower accuracy then separate flags
Conclusion
- Not deployable for production (cost-to-savings ratio)
- Promising usability (more flags)
- Easily scalable
Designed system
Flags
- Even page flag
- Barcode flag
- Page number flag
- Page similarity flag
- Logo flag
- etc.
- is first page
- is last page
Master model
- simple NN
- combining data from flags
- dynamic design
- false-positives elimination
- training on real-world data
Dataset
- in-house dataset of 23561 real-world documents (39895 pages)
- uneven data distribution
Other design choices
- configurability
- scalability
-
usable in production
- logging
- caching
Technology
- Python
- Tensorflow & Keras
- Tesseract OCR
- zbar
Applicable methods
Thank you
Posudok - oponent
1. architektúra sietí
Page look flag
Master model
- loss - Categorical Crossentropy
- loss - Binary Crossentropy
Posudok - oponent
2. metriky
Master model (base dataset, is_last)
- Precision: 58.14%
- Recall: 50%
- F1: 53.76%
Master Thesis
By Marián Skrip
Master Thesis
obhajoba
- 219