Szabo Roland Teodor

UBB

A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images

OuR Aims

investigate machine learning algorithms for creating an Optical Character Recognition engine tailored for receipts
use that OCR engine to create an application that simplifies personal finance manangement
compare to other OCR engines

Problem Statement

OCR

Best free and open source OCR engine: Tesseract

A Mathematical Theory of Communication, C.E. Shannon

The recent development of various methods of modulation such as PCM and PPM which exchange bandwith for signal-to-noise ratio has ....

Steps:

document layout analysis
character segmentation
character recognition

Our Model

Using raw pixel data, no feature extraction

Random forests for character segmentation

tried between 150 and 250 trees
number of features used between 8 and 120

Suppor Vector Machine for character recognition

linear and RBF kernel
regularization from 0.01 to 10000 on a log scale

Dataset

20 receipts were annotated by hand
~7000 characters
74 classes (different characters)

BUILT USING:

All three free, open-source libraries for scientific computing, machine learning and computer vision

results

Character Recognition

best accuracy with RBF kernel and regularization of 100: 91.01% on validation set

regularization is a must for RBF - using a value of 0.01 lead to accuracy of 9.16%
it matters less for linear kernel - accuracy between 70% and 89.5%

Most common missclassifications

, and . - 40 times
O and 0 - 39 times
1, l and I - 14 times

Character segmentation

best F1 score of 87.936% on validation set
number of trees or features influenced < 0.1%
consistent with theory established by Leo Breiman

Confusion Matrix

No split

Split

Predicted no split

4556

363

Predicted split

255

2546

Model is more specific, rather than sensitive

Comparison To other REsults

recognition results obtained by others on MNIST are better

SVM with RBF kernel - 98.6% accuracy
state-of-the-art - 99.7%
10x bigger dataset
7x fewer classes

segmentation results for license plate are better

96% accuracy
fixed number of letters to be segmented

Tesseract

ReceiptBudget

S.C. HRTIHH 5.9.
ELUJ NHPUCH. STR. BUCEGI. NR. 19

9.9.1. 99 11735629
RUN
1.999 x 3.19
BRTISTE N9Z.CLHS|C3S 3.19 9
1.999 x 2.59
STICKLETTI CHRTUF 99 2.59 9
9.399 x 7.99
HHNDHRINE 3.12 H
9.446 x 7.99
99511 3.56 9
1.999 x 11.29
SHLHH C959 USCHT 299 11.29 9
SUBTUTHL 23.5?
SUBTUTRL _____ ‘-29:99
TUTHL 23.57

S.C.   ARTIMA  S.A.
CLUJMAPOCA, STR. BUCEGI, NR.19
C  .U.1,R O1   1735628
RON
1.000  x  3,19
BATISTE NAZ.CLASIC3S 3,19%
1,000x   2,50
STICNLETTI  CARTOF 80 2,50 A
0,390  x  7,99
MANDARINE   3,12A
0,446  x  7,99
R0SI 3,56  A
1.000  x 11,20
SALANIASAOSCAT  290  11,20A
SUBTOTAL   23,57
SUBTOTAL 23,57
TOTAL   27,57

The Application

ReceiptBudget has an interactive dashboard
The goal is to get some insight into spending patterns

THE map

The charts

Built USING

d3.js

dc.js

crossfilter

Conclusion

OCR results are better than by using off-the-shelf components
but there is still some work to do

deep learning shows promise
need a larger dataset

I saved some money by looking at those graphs :)

Questions?

SCSS

By rolisz

SCSS

1,632

A Novel Machine Learning Based Approach for Retrieving Information from Receipt Images

OuR Aims

Problem Statement

OCR

Our Model

Dataset

BUILT USING:

results

Comparison To other REsults

The Application

THE map

The charts

Built USING

Conclusion

Questions?

SCSS

More from rolisz