Szabo Roland Teodor

UBB

 A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images

OuR Aims


  • investigate machine learning algorithms for creating an Optical Character Recognition engine tailored for receipts
  • use that OCR engine to create an application that simplifies personal finance manangement
  • compare to other OCR engines

Problem Statement


OCR




Best free and open source OCR engine: Tesseract
A Mathematical Theory of  Communication, C.E. Shannon

The recent development of various methods of modulation such as PCM and PPM which exchange bandwith for signal-to-noise ratio has ....




Steps:
  • document layout analysis
  • character segmentation
  • character recognition

Our Model

Using raw pixel data, no feature extraction

Random forests for character segmentation
  • tried between 150 and 250 trees
  • number of features used  between 8 and 120
Suppor Vector Machine for character recognition
  • linear and RBF kernel
  • regularization from 0.01 to 10000 on a log scale

Dataset

  • 20 receipts were annotated by hand
  • ~7000 characters
  • 74 classes (different characters)

BUILT USING:



All three free, open-source libraries for scientific computing, machine learning and computer vision

results

Character Recognition

  • best accuracy with RBF kernel and regularization of 100: 91.01% on validation set
    • regularization is a must for RBF - using a value of 0.01 lead to accuracy of 9.16%
    • it matters less for linear kernel - accuracy between 70% and 89.5% 
    Most common missclassifications
    • , and . - 40 times
    • O and 0 - 39 times
    • 1, l and I - 14 times


     

    Character segmentation

    • best F1 score of 87.936% on validation set
    • number of trees or features influenced < 0.1%
    • consistent with theory established by Leo Breiman
    Confusion Matrix 

     

    No split

    Split

    Predicted no split

    4556

    363

    Predicted split

    255

    2546


    Model is more specific, rather than sensitive

    Comparison To other REsults

    • recognition results obtained by others on MNIST are better
      • SVM with RBF kernel - 98.6% accuracy
      • state-of-the-art - 99.7%
      • 10x bigger dataset 
      • 7x fewer classes
    • segmentation results for license plate are better
      • 96% accuracy
      • fixed number of letters to be segmented
    Tesseract
    ReceiptBudget


    S.C. HRTIHH 5.9.
    ELUJ NHPUCH. STR. BUCEGI. NR. 19
    
    9.9.1. 99 11735629
    RUN
    1.999 x 3.19
    BRTISTE N9Z.CLHS|C3S 3.19 9
    1.999 x 2.59
    STICKLETTI CHRTUF 99 2.59 9
    9.399 x 7.99
    HHNDHRINE 3.12 H
    9.446 x 7.99
    99511 3.56 9
    1.999 x 11.29
    SHLHH C959 USCHT 299 11.29 9
    SUBTUTHL 23.5?
    SUBTUTRL _____ ‘-29:99
    TUTHL 23.57
    S.C.   ARTIMA  S.A.
    CLUJMAPOCA, STR. BUCEGI, NR.19
    C  .U.1,R O1   1735628
    RON
    1.000  x  3,19
    BATISTE NAZ.CLASIC3S 3,19%
    1,000x   2,50
    STICNLETTI  CARTOF 80 2,50 A
    0,390  x  7,99
    MANDARINE   3,12A
    0,446  x  7,99
    R0SI 3,56  A
    1.000  x 11,20
    SALANIASAOSCAT  290  11,20A
    SUBTOTAL   23,57
    SUBTOTAL 23,57
    TOTAL   27,57

    The Application

    • ReceiptBudget has an interactive dashboard
    • The goal is to get some insight into spending patterns

    THE map


    The charts


    Built USING

     



    Conclusion 

    • OCR results are better than  by using off-the-shelf components
    • but there is still some work to do
      • deep learning shows promise
      • need a larger dataset
    • I saved some money by looking at those graphs :)
     

    Questions?

    SCSS

    By rolisz

    SCSS

    • 1,586