Master's Thesis

Author:

Supervisor:

Bc. Marián Skrip

Mgr. Juraj Holas

Processing of administrative documents using machine learning

Goals

Designing and developing a system capable of separating documents during bulk scanning of administrative documents to reduce the number of manual steps necessary in traditional methods of digitizing paper documents

Results

Promising Flags

  • CWSSIM
  • Page number
  • Page look

Epoch accuracy of PageLookFlag during training

  • single-page documents 
  • long documents
  • biased towards over-segmenting

Confisusion matrix for CWSSIM flag

CWSSIM Flag

  • single-page documents -> no page number
  • mainly starting pages correct

Confisusion matrix for Page number flag

Page number flag

  • non-analytical
  • validation accuracy 85.3%
  • not precise real-world predictions

Page look flag

Confisusion matrix for Page look flag

Unsuccessful Flags

  • Barcode flag - sparse useful barcodes
  • Header / Footer flag - nothing to approve with CWSSIM flag

Confusion matrix for Barcode flag

Master model

Master model accuracy during training on base dataset

Master model accuracy during training on even dataset

  • Final accuracy of 81.25%
  • Not usable in production
  • Lower accuracy then separate flags

Conclusion

  • Not deployable for production (cost-to-savings ratio)
  • Promising usability (more flags)
  • Easily scalable

Designed system

Flags

  • Even page flag
  • Barcode flag
  • Page number flag
  • Page similarity flag
  • Logo flag
  • etc.
  • is first page
  • is last page

Master model

  • simple NN
  • combining data from flags
  • dynamic design
  • false-positives elimination
  • training on real-world data

Dataset

  • in-house dataset of 23561 real-world documents (39895 pages)
  • uneven data distribution

Other design choices

  • configurability
  • scalability
  • usable in production
    • logging
    • caching

Technology

Applicable methods

  • Image thresholding
  • OCR
  • SSIM / CWSSIM
  • Levenshtein distance
  • Keyword-matching

Thank you

Posudok - oponent

1. architektúra sietí

Page look flag

Master model

  • loss - Categorical Crossentropy
  • loss - Binary Crossentropy

Posudok - oponent

2. metriky

Master model (base dataset, is_last)

  • Precision: 58.14%
  • Recall: 50%
  • F1: 53.76%

Master Thesis

By Marián Skrip

Master Thesis

obhajoba

  • 219