Guided by:
Dr. Mitesh M. Khapra
Presented by:
Jaya Ingle
Date: 7/5/19
Department of Computer Science and Technology, IIT Madras
2. To recognize the text in cropped word image.
सेल्फी
जिला
भारत
Introduction
Challenges involved
Lightening Conditions
Varying orientation
Perspective distortion
Complex background
To solve any machine learning problem, we need
Before going on,
1. Data
2. Model
4. Learning Algorithm
3. Parameters
5. Loss Functions
Specifically in case of Indian languages(Hindi),
1. Data
1. Synthetic Dataset for Text Detection (Shubham's thesis)
2. Synthetic Dataset for Text Recognition(Hindi) (in this work)
So, we have generated artificial (or synthetic dataset) for Indian Languages (Tamil, Telugu, Punjabi, Malayalam, Hindi)
3. For test images, we have manually clicked the pictures and annotate them!
Things required:
Hindi Wiki book database
Google fonts
1. Text
2. Font
3. Color of text
4. Background images
5. Rendering method
Discussed on next slide
Freely available on internet
using dictionary of background and foreground colors
Rendering method:
Unlike English, Unicode level information won't be sufficient to render text on image
We need glyphs level information in this case. Before that, we select, text, font, background. With the help of a few rendering libraries of Python, we were able to render text on the image.
we store glyphs and their positions in buffer
Till the buffer did not get empty, we keep rendering the glyphs on the surface and finally rendered surface is converted to the image.
कि
ि
क
will be rendered as
कि
glyphs will be क, ि with ordering between them
कि
Also applied distortion and noise to randomly selected 80% images
To solve any machine learning problem, we need
Before going into literature,
1. Data
2. Model
4. Learning Algorithm
3. Parameters
5. Loss Functions
Specifically in case of Indian languages,
1. Data
2. Model (Detection /Recognition)
Solutions of Text Detection
Regression based solutions
Segmentation based solutions
Groundtruth bounding box
Anchor box
text ?
text ?
Regression-based Model
text score map
Geometry map
rotation angle
left
right
top
bottom
It tries to directly predict word or text line predictions, which are further sent to Non Maximum Suppression (NMS)
Network
Loss
\[ = L_{scoremap} + \lambda_g L_{gemoetry} \]
\[L_{scoremap} \]
= balanced cross entropy between predicted
and ground truth score map
\[L_{geometry} \]
= IoU loss + 1- cos(angle_p, angle_gt)
Results
Implementational details
Training
Testing
scale | Precision | Recall | F-measure |
---|---|---|---|
512x512 | 0.6445 | 0.4441 | 0.5258 |
Nearest multiple of 32 | 0.6788 | 0.4045 | 0.5069 |
*with IoU threshold as 0.4
Results
scale | Precision | Recall | F-measure |
---|---|---|---|
1024x1024 | 0.91100 | 0.68159 | 0.77977 |
512x512 | 0.954898 | 0.74334 | 0.83594 |
256x256 | 0.9421052 | 0.650121 | 0.76934 |
nearest multiple of 32 | 0.93323 | 0.76150 | 0.83866 |
*with IoU threshold as 0.4
Solutions of Text Detection
Regression based solutions
Segmentation based solutions
Groundtruth bounding box
Anchor box
text ?
text ?
Segmentation based model
Unlike regression method, which performs 2 types of tasks,
Here only text/non-text prediction is required. But 2 type of predictions are done:
positive pixel
negative pixel
positive link
negative link
Network
Loss
\[ = L_{link} + \lambda L_{pixel} \]
\[L_{link} \]
\[L_{pixel} \]
\[ \lambda \]
= instance balanced cross entropy
= 2
= Cross entropy on link x sum of W for +ve and -ve links
* With base network as VGG-16
Implementational details
Results
Training
Testing
Precision | Recall | F-measure |
---|---|---|
0.4393 | 0.4534 | 0.42313 |
IoU threshold=0.4, link and pixel threshold=0.5
Precision | Recall | F-measure |
---|---|---|
0.52 | 0.5573 | 0.5380 |
After finetuning on 300 real images,
Effect on changing pixel threshold
Effect on changing link threshold
Thres | Precision | Recall | F-measure |
---|---|---|---|
0.5 | 0.520053 | 0.55730 | 0.53803 |
0.6 | 0.53269 | 0.57368 | 0.50153 |
0.7 | 0.538461 | 0.58166 | 0.55922 |
0.8 | 0.58377 | 0.62893 | 0.60551 |
0.9 | 0.57490 | 0.616604 | 0.59433 |
Thres | Precision | Recall | F-measure |
---|---|---|---|
0.5 | 0.52005 | 0.55730 | 0.53803 |
0.6 | 0.58115 | 0.63610 | 0.60738 |
0.7 | 0.57657 | 0.64183 | 0.60745 |
0.8 | 0.56045 | 0.63753 | 0.59651 |
* link thres=0.5, IoU thres=0.4
* pixel thres=0.8, IoU thres=0.4
Qualitative results
Comparison
EAST
Pixel link
Comparison
EAST
Pixel link
Comparison
EAST
Pixel link
भारत
Unlike the relation between object detection and text detection, there are significant differences between object recognition and text recognition,
In text recognition, we have to predict the series of labels, that why this problem is posed as a sequence recognition problem.
भ + ा + र + त
Model
Vgg-16 as feature extractor
Map to sequence
BLSTM
BLSTM
Transcription
जिला
per-frame predictions
label sequence
height normalized feature maps
feature maps
input image
Transcription
To solve this a framework known as CTC ( Connectionist Temporal Classification)
- is to find the label sequence with the highest probability conditioned on the per-frame predictions
How to find this conditional probability?
we need a mapping function which can map from output sequence to label sequence.
P(label seq|per-frame prediction)* = sum of P(output seq |per-frame prediction)
Such that the output seq can be mapped to label sequence.
The mapping function will map
ि ि ि-----जजज---लल----ा
जिला
So, the conditional probability will be framed as,
While decoding, we will return the output sequence with highest probability
*In reality, it is difficult to compute so we use forward-backward algorthm
Results
Implementational Details
Training
Testing
Mean per char acc | Mean full sequence acc | Mean edit distance |
---|---|---|
0.651006 | 0.419540 | 0.2968 |
Total number of labels = 129
[ 128 (according to Unicode representation) + 1 (unknown label) ]
Qualitative Results
अधिनियम
कालोनी
स्वल्पाहार
नित्यानंद
विभागाध्यक्ष
मंगलमय
Mistakes
इटास्पी
इलेक्टनिक्स
अभिवान
इबलिंग
डेन
सर्वशेष्ठ
विशाम
अघिनर्थ
कों
पथम
रकूल
ज
दं
द
सं
Its not only model's mistake!
Distorted images