Szabo Roland Teodor

UBB

Obtaining personal Finance data from receipts using machine learning

Motivation and Problem Statement
Related work
ReceiptBudget - the application
Conclusions

The results obtained in this paper were also presented at the Scientific Communication Session for Students

Motivation

came to college in 2011

money seemed to evaporate

needed a tool to keep track of it

existing ones weren't good enough

Aims

create a new personal finance management tool
develop an OCR engine tailored for receipts
show interactive visualizations of expenses

Optical character recognition

Steps performed

document layout analysis

character segmentation

character recognition

Existing OCR ENGINEs

OCROpus

poorer recognition performance
does sophisticated document layout analysis

Tesseract

good general performance
no document layout analysis

Our OCR engine

knows layout of receipts
specially trained for receipt font

which is weird, compressed, broken

Dataset

20 receipts were annotated by hand
~7000 characters
74 classes (different characters)

BUILT USING:

All three free, open-source libraries for scientific computing, machine learning and computer vision

Results

Character Recognition

SVM baseline with RBF kernel - 91.01%
deep learning with neural networks - 98.5%

Model	Mean accuracy	Std deviation
SVM	91.01%	0.126
SAU	95.625%	0.387
MLP2	97.988%	1.244
MLP3	97.576%	1.768
MLP4	98.506%	0.256

Most common missclassifications

, and . - 40 times
O and 0 - 39 times
1, l and I - 14 times

Character segmentation

Random forests
tried between 150 and 250 trees
number of features used between 8 and 120
best F1 score of 87.9%

Confusion Matrix

No split

Split

Predicted no split

4556

363

Predicted split

255

2546

Model is more specific, rather than sensitive

Line Classification

mostly regular expressions to match various patterns
92.8% accuracy
depends a lot on results from previous step:

T0TAL
STA REPUBLICII

Comparison to Other Results

recognition results obtained by others on MNIST are better

SVM with RBF kernel - 98.6% accuracy
state-of-the-art - 99.7%
10x bigger dataset
7x fewer classes

segmentation results for license plate are better

96% accuracy
fixed number of letters to be segmented

Tesseract

ReceiptBudget

S.C. HRTIHH 5.9.
ELUJ NHPUCH. STR. BUCEGI. NR. 19

9.9.1. 99 11735629
RUN
1.999 x 3.19
BRTISTE N9Z.CLHS|C3S 3.19 9
1.999 x 2.59
STICKLETTI CHRTUF 99 2.59 9
9.399 x 7.99
HHNDHRINE 3.12 H
9.446 x 7.99
99511 3.56 9
1.999 x 11.29
SHLHH C959 USCHT 299 11.29 9
SUBTUTHL 23.5?
SUBTUTRL _____ ‘-29:99
TUTHL 23.57

S.C.   ARTIMA  S.A.
CLUJMAPOCA, STR. BUCEGI, NR.19
C  .U.1,R O1   1735628
RON
1.000  x  3,19
BATISTE NAZ.CLASIC3S 3,19%
1,000x   2,50
STICNLETTI  CARTOF 80 2,50 A
0,390  x  7,99
MANDARINE   3,12A
0,446  x  7,99
R0SI 3,56  A
1.000  x 11,20
SALANIASAOSCAT  290  11,20A
SUBTOTAL   23,57
SUBTOTAL 23,57
TOTAL   27,57

The Application

ReceiptBudget has an interactive dashboard
The goal is to get some insight into spending patterns

THE map

The charts

Built USING

d3.js

dc.js

crossfilter

Conclusion

OCR results are better than by using off-the-shelf components

incorporating domain specific knowledge helps

interactive dashboard is helpful

I saved some money:)

Questions?

Licenta

By rolisz

Licenta

11 years ago
1,838

Obtaining personal Finance data from receipts using machine learning

Contents

Motivation

Aims

Optical character recognition

Existing OCR ENGINEs

Our OCR engine

Dataset

BUILT USING:

Results

Character Recognition

Line Classification

Comparison to Other Results

The Application

THE map

The charts

Built USING

Conclusion

Questions?

Licenta

More from rolisz