ICCV 2019

Paper Registration Deadline March 15, 2019 (11:59PM PST)
Paper Submission Deadline March 22, 2019 (11:59PM PST)
Supplementary Materials Deadline March 29, 2019 (11:59PM PST)
Reviews Released to Authors June 14, 2019 (11:59PM PST)
Rebuttals Due June 26, 2019 (11:59PM PST)
Final Decisions to Authors July 22, 2019

Submission Timeline:

Goals

1) Show that reasoning over plots can not be treated as a Multi-Class Classification (MCC) probem

2) Motivation for a new dataset (DIP)

                 2.1) DIP is a harder dataset than the existing datasets

3) Motivation for a new method (Our Pipeline)

                 3.1) Our pipeline performs better than the existing SOTA methods

March 12, 2019

How we are achieving these goals:

An Intuitive Story

While doing Question-Answering over scientific plots, the answer :

1) comes from a fixed vocabulary

2) comes from the plot itself (plot specific)

e.g. Yes, No, horizontal, ...

e.g. title of the graph,

      textual content which is not in the vocabulary, ...

3) needs to be calculated/generated

e.g. average of numbers, difference of numbers, ...

Multi-class classification problem addresses only the 1st bullet

Goal 1 is achieved here (i.e., we can not treat reasoning over plots as a MCC problem)

How we are achieving these goals:

An Intuitive Story

While doing Question-Answering over scientific plots, the answer :

1) comes from a fixed vocabulary

2) comes from the plot itself (plot specific)

e.g. Yes, No, horizontal, ...

e.g. title of the graph,

      textual content which is not in the vocabulary, ...

3) needs to be calculated/generated

e.g. average of numbers, difference of numbers, ...

FigureQA

DVQA

DIP

Goal 2 and 2.1 is achieved (i.e., a harder dataset like DIP is required to do reasoning over plots)

How we are achieving these goals:

An Intuitive Story

While doing Question-Answering over scientific plots, the answer :

1) comes from a fixed vocabulary

2) comes from the plot itself (plot specific)

e.g. Yes, No, horizontal, ...

e.g. title of the graph,

      textual content which is not in the vocabulary, ...

3) needs to be calculated/generated

e.g. average of numbers, difference of numbers, ...

SAN

SANDY

Our Model

Hybrid

MOM

Goal 3 and 3.1 is achieved (i.e., a method which addresses all types of answers is needed)

Existing datasets have only those questions types for which answers can either be classified from a fixed vocabulary or are plot specific.

 

FigureQA : All the answers are either Yes/No only.

 

DVQA : Out of 25 templates,

  • answers for 14 templates comes from a fixed vocabulary
  • 9 templates can be answered based on the plot vocabulary
  • answers of only 2 templates need to be calculated

How we are achieving these goals:

An Intuitive Story

Goal 2 and 2.1 is achieved here.

DIP : Out of 74 templates,

  • answers for 36 templates comes from a fixed vocabulary
  • 15 templates can be answered based on the plot vocabulary
  • answers of  23 templates need to be calculated

March 12, 2019

Can we merge the 2 templates of DVQA with other answer types?

How we are achieving these goals:

An Intuitive Story

Goal 3 and 3.1 is achieved here.

Existing methods address only those questions types for which answers can either be classified from a fixed vocabulary or are plot specific.

 

SAN (classifies the answer from a fixed vocabulary)

 

SANDY, MOM (classifies the answer from a fixed vocabulary and plot vocabulary)

Our proposed model

(addresses only those questions for which the answer needs to be calculated)

Hybrid model

(addresses Yes/No answers, fixed vocab answers, plot-specific answers, calculated answers)

March 12, 2019

How we are achieving these goals:

An Empirical Story

DVQA DIP
SAN
Our model
Hybrid model-1
Hybrid model-2 57.99% 49.71%

If we can show that the numbers in the red part are greater than that in green part:

  • we can say that DVQA is comparatively easier than DIP

If we can show that the numbers in the yellow part are less than that in blue part:

  • we can say that our method is better than the existing SOTA methods
DVQA DIP
SAN 36.04% 45.7%
Our model 48.62% 16.3%
Hybrid model-1 48.35%

* All accuracies are calculated with exact match.

These numbers don't tell us the full story because the question distribution of both the datasets is not equal

We cannot conclude anything unless we run the Hybrid model on DVQA

Exact match is not a good metric for comparing floating point numbers

March 12, 2019

meta-classifier based on the templates

meta-classifier based on the predictions

How we are achieving these goals:

An Empirical Story

DIP
0% 49.71%
1% 50.77%
5% 53.96%
10% 55.96%

meta-classifier based on the predictions

Hybrid Model

How we are achieving these goals:

An Empirical Story

The rest of the slides contain the following results :

  • Zooming in the accuracy of different models (Question wise)
  • Zooming in the accuracy of different models (Answer wise)
  • Zooming in the accuracy of different models (Question wise and Answer wise)

March 12, 2019

Template wise accuracy of different models

Structure Data Reasoning
94.71% 18.78% 37.29%
60.53% 45.56% 47.52%
- - -
85.44% 51.04% 55.35%
96.47% 37.82% 41.5%
Structure Data Reasoning
83.6% 37.93% 24.71%
26.39% 20.06% 7.64%
83.61% 45.59% 26.34%
86.3% 47.25% 26.67%
- - -
Datasets DVQA
Methods\Templates
SAN
Our Model
Hybrid Model-1
Hybrid Model-2
SANDY (OCR)
DIP

Template Distribution (TEST)

Template wise Accuracy

Structure Data Reasoning
13.48% 31.93% 54.59%
Structure Data Reasoning
30.37% 23.97% 45.66%
Datasets DVQA
Methods\Templates
Distribution
DIP

* All accuracies are calculated with exact match.

Zooming in the accuracy of different models

Template wise accuracy of different models

Zooming in the accuracy of different models

Structure Data Reasoning
86.3% 47.25% 26.67%
86.3% 47.96% 28.61%
86.3% 50.05% 34.49%
86.3% 50.92% 37.15%
Tables DIP
Threshold \ Templates
0%
1%
5%
10%

meta-classifier based on the predictions

Hybrid Model

Template wise Accuracy

Answer wise accuracy of different models

Yes/No Fixed Vocab OOV
Yes/No Fixed Vocab OOV
83.00% 49.84% 0.00%
0.00% 33.02% 4.14%
83.00% 53.07% 4.14%
83.00% 56.04% 4.14%
Datasets DVQA
Methods\Templates
SAN
Our Model
Hybrid Model-1
Hybrid Model-2
DIP

Answer wise Accuracy

Yes/No Fixed Vocab OOV
23.46% 76.53% 0.00%
Yes/No Fixed Vocab OOV
27.46% 46.1% 26.4%
Datasets DVQA
Methods\Templates
Distribution
DIP

* All accuracies are calculated with exact match.

Answer Distribution (TEST)

Zooming in the accuracy of different models

Answer wise accuracy of different models

Zooming in the accuracy of different models

Tables DIP
Threshold \ Templates
0%
1%
5%
10%
Yes/No Fixed Vocab OOV
83.00% 56.04% 4.14%
83.00% 56.04% 8.14%
83.00% 56.04% 20.19%
83.00% 56.04% 25.54%

meta-classifier based on the predictions

Hybrid Model

Answer wise Accuracy

Accuracy of different models

Zooming in the accuracy of different models

Structure Data Reasoning
82.13% 15.02% 14%
17.8% 84.98% 85.91%
0.00% 0.00% 0.00%
Yes/No
Fixed Vocab
OOV
Structure Data Reasoning
37.59% 20.85% 24.18%
62.4% 56.3% 29.89%
0.00% 22.84% 45.92%

Template wise Answer Distribution (TEST)

Datasets DVQA DIP
Answer \ Template

Accuracy of different models

Structure Data Reasoning
92.15% 86.49% 50.55%
54.57% 5.01% 23.96%
0.00% 0.00% 0.00%
Structure Data Reasoning
94.01% 95.35% 66.02%
77.3% 32.06% 29.27%
NA 0.00% 0.00%
0.00% 0.00% 0.00%
42.29% 27.61% 25.48%
NA 19.77% 0.054%
94.01% 95.35% 66.02%
77.33% 37.63% 34.63%
NA 19.77% 0.054%
94.01% 95.35% 66.0%
81.66% 40.6% 35.74%
NA 19.77% 0.054%
Yes/No
Fixed Vocab
OOV
Yes/No
Fixed Vocab
OOV
SAN
Our
Model
DVQA DIP
Answer \ Template
Yes/No
Fixed Vocab
OOV
Hybrid Model-1

Zooming in the accuracy of different models

* All accuracies are calculated with exact match.

Template wise Answer Accuracy (TEST)

Yes/No
Fixed Vocab
OOV
Hybrid Model-2

Accuracy of different models

Zooming in the accuracy of different models

Template wise Answer Accuracy (TEST)

Structure Data Reasoning
94.01% 95.35% 66%
81.66% 40.6% 35.74%
NA 19.77% 0.054%
Answer \ Template
Yes/No
Fixed Vocab
OOV

DIP

Structure Data Reasoning
94.01% 95.35% 66%
81.66% 40.6% 35.74%
NA 22.88% 4.29%
Answer \ Template
Yes/No
Fixed Vocab
OOV

DIP

1%

0%

meta-classifier based on the predictions

Hybrid Model

Accuracy of different models

Zooming in the accuracy of different models

Structure Data Reasoning
94.01% 95.35% 66%
81.66% 40.6% 35.74%
NA 32.06% 17.1%
Answer \ Template
Yes/No
Fixed Vocab
OOV

DIP

Structure Data Reasoning
94.01% 95.35% 66%
81.66% 40.6% 35.74%
NA 35.82% 22.87%
Answer \ Template
Yes/No
Fixed Vocab
OOV

DIP

10%

5%

Template wise Answer Accuracy (TEST)

meta-classifier based on the predictions

Hybrid Model

Shortcomings of our proposed model are:

  1. Fails to answer Yes/No questions.
  2. Fails to answer questions for which the grammar is not defined. Example: In how many countries, is the CO2 emission greater than the average CO2 emission taken over all countries ?
  3. Positional information is lost in the tables. Hence, our method fails to answer structural and data questions where image is required to answer correctly. Example: How many bars are there on the 4th tick from the right?

That's why we need a Hybrid model.

Motivation for Hybrid Model

March 12, 2019

In our last meeting, these TODOS were decided :

  1. Give SEMPRE accuracy on different thresholds
  2. SEMPRE accuracy on ORACLE tables
  3. Train SEMPRE only on Reasoning templates having OOV answers  (Running)
  4. Train meta-classifier using predictions rather than question templates 
  5. Calculate OCR accuracy on Mask-RCNN detections   
  6. Hybrid model for DVQA
  7. Generate high quality plots

Analysis done in last meeting

March 13, 2019

Stage-wise Analysis of our pipeline

Stage-wise Analysis of our pipeline

Stage

Visual Elements Detection (VED)

Optical Character Recognition (OCR)

Semi-structured information extraction (SIE)

Table Question Answering (QA)

Accuracy

Method

Mask-RCNN

Tesseract

Rule based

SEMPRE

94.21% 

97.06%

ab.cd%

Trained with 0.5 overlap

80.44% 

ab.cd%

Mask-RCNN

92.69% 

mAP@IOU=0.5

79.58% 

39.88%

Trained with 0.75 overlap

mAP@IOU=0.8

mAP@IOU=0.9

mAP@IOU=0.5

mAP@IOU=0.8

mAP@IOU=0.9

(Oracle bounding boxes)

93.1%

(bounding boxes after VED)

32.55%

(Oracle Tables)  5% Threshold

20.22%

(Tables generated after VED+OCR)                 5% Threshold

Stage-wise Analysis of our pipeline

Visual Element Detection (VED)

IoU@0.5 IoU@0.8 IoU@0.9
DIP DIP DIP
bar 96.24% 77.57% 47.54%
dot-line 95.05% 62.48% 4.96%
legend-label 99.77% 98.13% 50.83%
line 51.83% 23.52% 5.83%
preview 99.87% 89.64% 32.43%
title 99.91% 81.41%
xlabel 99.94% 94.43% 46.12%
xticklabel 99.75% 89.90% 33.72%
ylabel 99.97% 98.86% 80.53%
yticklabel 99.84% 88.42% 36.31%
mAP 94.21% 80.44% ab.cd%

MASK-RCNN trained with 0.5 overlap: TEST_FAMILIAR

Stage-wise Analysis of our pipeline

Optical Character Recognition (OCR)

Textual Elements Oracle bounding boxes Bounding boxes after VED
xlabel
ylabel
yticklabel
xticklabel
title
legend-label
Overall
OCR Accuracy

Stage-wise Analysis of our pipeline

Semi-structured information extraction (SIE)

Stage-wise Analysis of our pipeline

Table Question Answering (SEMPRE)

  • Oracle Tables are the tables generated directly from the image annotations
  • Generated Tables are the tables generated after passing the image through our pipeline i.e, through VED, OCR and SIE stage

Overall accuracy on Oracle and Generated tables

Threshold \ Tables Oracle Tables Generated Tables
0% 31.8% 16.3%
1% 32.33% 17.2%
5% 32.55% 20.2%
10% 32.74% 21.47%

Stage-wise Analysis of our pipeline

Table Question Answering (SEMPRE)

Template wise Accuracy on Oracle and Generated tables

Structure Data Reasoning
26.54% 42.18% 29.88%
26.54% 42.42% 30.87%
26.54% 42.49% 31.33%
26.54% 42.54% 32%
Tables Oracle Tables Generated Tables
Threshold \ Templates
0%
1%
5%
10%
Structure Data Reasoning
26.39% 20.06% 7.64%
26.39% 20.77% 9.36%
26.39% 22.88% 14.71%
26.39% 23.74% 17%

Stage-wise Analysis of our pipeline

Table Question Answering (SEMPRE)

Answer wise Accuracy on Oracle and Generated tables

Yes/No Fixed Vocab OOV
0.00% 43.86% 43.74%
0.00% 43.86% 45.64%
0.00% 43.86% 46.49%
0.00% 43.86% 47.19%
Tables Oracle Tables Generated Tables
Threshold \ Templates
0%
1%
5%
10%
Yes/No Fixed Vocab OOV
0.00% 33.02% 4.14%
0.00% 33.02% 7.74%
0.00% 33.02% 18.89%
0.00% 33.02% 23.64%

Stage-wise Analysis of our pipeline

Table Question Answering (SEMPRE)

Template wise answer accuracy on Oracle and Generated tables

Structure Data Reasoning
0.00% 0.00% 0.00%
42.29% 27.61% 25.48%
NA 22.89% 3.78%
Answer \ Template
Yes/No
Fixed Vocab
OOV

Generated Tables

Structure Data Reasoning
0.00% 0.00% 0.00%
42.55% 41.79% 47.74%
NA 81.44% 33.88%

Oracle Tables

Structure Data Reasoning
0.00% 0.00% 0.00%
42.29% 27.61% 25.48%
NA 22.89% 3.78%
Answer \ Template
Yes/No
Fixed Vocab
OOV

Oracle Tables

Structure Data Reasoning
0.00% 0.00% 0.00%
42.55% 41.79% 47.74%
NA 82.48% 36.01%

Generated Tables

1%

0%

Stage-wise Analysis of our pipeline

Table Question Answering (SEMPRE)

Template wise answer accuracy on Oracle and Generated tables

Answer \ Template
Yes/No
Fixed Vocab
OOV

Generated Tables

Oracle Tables

Answer \ Template
Yes/No
Fixed Vocab
OOV

Oracle Tables

Generated Tables

10%

5%

Structure Data Reasoning
0.00% 0.00% 0.00%
42.55% 41.79.% 47.74%
NA 83% 36.99%
Structure Data Reasoning
0.00% 0.00% 0.00%
42.29% 27.61% 25.48%
NA 32% 15.44%
Structure Data Reasoning
0.00% 0.00% 0.00%
42.29% 27.61% 25.48%
NA 35.90% 20.43%
Structure Data Reasoning
0.00% 0.00% 0.00%
42.55% 41.79% 47.74%
NA 83% 37.84%

ICCV 2019 (Final)

By Nitesh Methani

ICCV 2019 (Final)

  • 1,070