Suman Banerjee
Department of Computer Science and Engineering,
Indian Institute of Technology Madras
Suman Banerjee
Department of Computer Science and Engineering,
Indian Institute of Technology Madras
Outline
Outline
Dialogue Systems
Siri
Cortana
Bixby
Google assistant
Alexa
Apple Homepod
Amazon Echo
Google Home
Modular Architecture
Language Understanding
Dialogue State Tracking
Policy Optimizer
Language Generation
User utterance
System response
Semantic Frame
System Action
Dialogue State
I need a cheap chinese restaurant in the north of town.
request_rest(cuisine=chinese, price=cheap, area=north)
Knowledge Base
request_people( )
Sure, for how many people?
Dialogue Manager
Probabilistic methods in spoken-dialogue systems, Steve J. Young, Philosophical Transactions: Mathematical, Physical and Engineering Sciences, 2000
End-to-End Architecture
User utterance
I need a cheap chinese restaurant in the north of town.
System response
Sure, for how many people?
Knowledge Base
End-to-End Dialogue System
Outline
Goal-oriented Dialogue phases
System: Hello, what can I help you with today?
User: Hi!
User: I'd like to book a table for six people in an expensive restaurant with British food.
System: where should it be ?
User: Somewhere in the north of the town.
System: ok I am on it!
System: API_CALL (British, north, six, expensive)
System: How about The_Place?
User: No I don't like it.
System: Ok, then, what do you think about Fancy_Pub?
User: Yes! lets do it.
System: Great! I'll do the reservation.
User: What is the address?
System: Here it is: Fancy_Pub_Address
User: Thank You
System: You're welcome
The_Place R_cuisine British
The_Place R_location north
The_Place R_address The_Place_Address
The_Place R_price expensive
The_Place R_rating 10
Fancy_pub R_cuisine British
Fancy_pub R_location north
Fancy_pub R_address Fancy_pub_Address
Fancy_pub R_price expensive
Fancy_pub R_rating 8
Goal-oriented Dialogue phases
Pre-KB
System: Hello, what can I help you with today?
User: Hi!
User: I'd like to book a table for six people in an expensive restaurant with British food.
System: where should it be ?
User: Somewhere in the north of the town.
System: ok I am on it!
System: API_CALL (British, north, six, expensive)
System: How about The_Place?
User: No I don't like it.
System: Ok, then, what do you think about Fancy_Pub?
User: Yes! lets do it.
System: Great! I'll do the reservation.
User: What is the address?
System: Here it is: Fancy_Pub_Address
User: Thank You
System: You're welcome
KB
Post-KB
The_Place R_cuisine British
The_Place R_location north
The_Place R_address The_Place_Address
The_Place R_price expensive
The_Place R_rating 10
Fancy_pub R_cuisine British
Fancy_pub R_location north
Fancy_pub R_address Fancy_pub_Address
Fancy_pub R_price expensive
Fancy_pub R_rating 8
Structural Information
Dependency Parse of sentences
Knowledge Graph
Code Mixing
Speaker 1: Hi, can you help me with booking a table at a restaurant?
Speaker 2: Sure, would you like something in cheap, moderate or expensive?
Speaker 1: Hi, kya tum ek restaurant mein table book karne mein meri help karoge?
Speaker 2: Sure, aap ko kaunsi price range mein chahiye, cheap, moderate ya expensive?
Speaker 1: Hi, tumi ki ekta restaurant ey table book korte amar help korbe?
Speaker 2: Sure, aapni kon price range ey chaan, cheap, moderate na expensive?
Problem
Outline
Single Attention Distribution
Sequential Attention
Pre-KB
Post-KB
KB
Sequential Attention
Post-KB RNN
Post-KB
Query RNN
\(\boldsymbol \alpha_t\)
\(\Big\{\)
\(\mathbf{h}_{post}\)
Sequential Attention
\(\boldsymbol \alpha_t\)
\(\Big\{\)
\(\mathbf{h}_{post}\)
KB
KB Memory Network
Sequential Attention
\(\boldsymbol \alpha_t\)
\(\Big\{\)
\(\mathbf{h}_{post}\)
Pre-KB
\(\boldsymbol \beta_t\)
\(\Big\}\)
\(\mathbf{h}_{pre}\)
End-to-End Network
Outline
Graph Convolutional Network (GCN)
Semi-supervised classification with graph convolutional networks. Kipf and Welling, ICLR, 2017.
Problem
Syntactic GCNs with RNN
RNN-Encoder
Syntactic GCNs with RNN
GCN
RNN - GCN
GCN with Sequential Attention
Query Attention
\[ \alpha_{jt} = f_1(\mathbf{c}^f_j, \mathbf{d}_{t-1}) \]
\[ \mathbf{h}^Q_t =\sum_{j'=1}^{|Q|} \alpha_{j't}\mathbf{c}_{j'}^f \]
History Attention
\[\beta_{jt} = f_2(\mathbf{a}^f_j, \mathbf{d}_{t-1}, \mathbf{h}^Q_t)\]
\[ \mathbf{h}^H_t = \sum_{j'=1}^{|H|} \beta_{j't}\mathbf{a}_{j'}^f \]
KB Attention
\[ \gamma_{jt} = f_3(\mathbf{r}^f_j,\mathbf{d}_{t-1}, \mathbf{h}^Q_t,\mathbf{h}^H_t)\]
\[ \mathbf{h}^K_t = \sum_{j'=1}^m \gamma_{j't}\mathbf{r}_{j'}^f \]
GCNs for code-mixed utterances
\(^1\)Word association norms mutual information, and lexicography, Church and Hanks, Computational Linguistics, 1990
Outline
Modified DSTC2
System: Hello, what can I help you with today?
User: Hi!
User: I'd like to book a table for six people in an expensive restaurant with British food.
System: where should it be ?
User: Somewhere in the north of the town.
System: ok I am on it!
System: API_CALL (British, north, six, expensive)
The_Place R_cuisine British
The_Place R_location north
The_Place R_address The_Place_Address
The_Place R_price expensive
The_Place R_rating 10
Fancy_pub R_cuisine British
Fancy_pub R_location north
Fancy_pub R_address Fancy_pub_Address
Fancy_pub R_price expensive
Fancy_pub R_rating 8
System: How about The_Place?
User: No I don't like it.
System: Ok, then, what do you think about Fancy_Pub?
User: Yes! lets do it.
System: Great! I'll do the reservation.
User: What is the address?
System: Here it is: Fancy_Pub_Address
User: Thank You
System: You're welcome
Learning end-to-end goal-oriented dialog, Bordes et. al. , ICLR, 2017.
Code-mixed Data Collection
Extract Unique
Utterances
Replace entities
with placeholders
Unique
utterances
Crowdsource
code-mixing
utterance templates
Replace placeholders
with entities
code-mixed templates
Replace utterances
back into dialogue
code-mixed
utterances
code-mixed dialogue data
English
dialogue data
Sorry there is no Chinese restaurant in the west part of town
Sorry there is no Italian restaurant in the north part of town
Sorry there is no <CUISINE> restaurant in the <AREA> part of town
Outline
Results on En-DSTC2
RNN + CROSS -GCN-SeA
Results on code-mixed data
Effect of using more GCN hops
PPMI vs Raw Frequencies
Dependency or co-occurrence structure really needed ?
Dependency edges
Random edges
Ablations
RNN
GCN
RNN-GCN
Encoder :
Attention :
Bahdanau
Sequential
GCN
RNN-GCN
Sequential
RNN
GCN
RNN-GCN
Ablations
Ablations
GCNs do not outperform RNNs independently:
performance of GCN-Bahdanau attention < RNN-Bahdanau attention
Our Sequential attention outperforms Bahdanau attention:
GCN-Bahdanau attention < GCN-Sequential attention
RNN-Bahdanau attention < RNN-Sequential attention (BLEU & ROUGE)
RNN+GCN-Bahdanau attention < RNN+GCN-Sequential attention
Combining GCNs with RNNs helps:
RNN-Sequential attention < RNN+GCN-Sequential attention
Best results are always obtained by the final model which combines RNN, GCN and Sequential attention
Results on Human-Human Dialogue Datasets
Results on Human-Human Dialogue Datasets
Results on Cam676
Results on MultiWOZ
Human Evaluations
Conclusion
A single attention distribution overburdens the attention mechanism
Separated the history into Pre-KB, KB and Post-KB parts and attended sequentially over them
Showed that structure-aware representations are useful in goal-oriented dialogue
Used GCNs to infuse structural information of dependency graphs into the learned representations
When dependency parsers are not available, we used word co-occurrence frequencies and PPMI values to extract a contextual graph
Obtained state-of-the-art performance on four datasets.
Future Work
Extend the model to multidomain goal-oriented dialogue (restaurants, hotels, taxi)
Conditional code-mixed response generation
Use the whole Knowledge-Graph instead of dialogue specific KB triples
Use semantic graphs along with dependency parse trees
Publications
Thank You
Language Understanding
Language Understanding
User utterance
Predicted
Intent
I need a cheap chinese restaurant in the north of town.
Classifier
Semantic Frame
request_rest(cuisine=chinese, price=cheap, area=north)
Intent Classification
Intent Classification
Language Understanding
User utterance
Predicted
Tags
I need a cheap chinese restaurant in the north of town.
Slot Filling
I | <null> |
need | <null> |
a | <null> |
cheap | <price> |
chinese | <cuisine> |
restaurant | <null> |
in | <null> |
the | <null> |
north | <area> |
of | <null> |
town | <null> |
Dialogue Management
User: Book a table at Prezzo for 5.
System: How many people?
User: For 3.
#People
Time
Language Generation
Language Generation
inform(rest=Prezzo, cuisine=italian)
System action
System response
Prezzo is a nice restaurant which serves italian.
IOU: 0.53
IOU: 0.71
IOU=0.9
IOU=0.75
IOU=0.77
IOU=0.5
plot title
bar
bar
x-axis label
legend label
x-axis tick
y-axis label
y-axis tick
legend preview
(d) FRCNN (RA)
(e) FRCNN (FPN+RA)
(f) FrRCNN (RP)
(g) FrRCNN (RA)
(h) FrRCNN (FPN+RA)
(a) SSD
(c) RetinaNet
(b) YOLO-v3
(i) PlotNet-v1
(a) Input image of a bar plot
(b) Post edge detection
(c) Post contour extraction
(d) Image embedded with the proposals
(d) FRCNN (RA)
(e) FRCNN (FPN+RA)
(f) FrRCNN (RP)
(g) FrRCNN (RA)
(h) FrRCNN (FPN+RA)
(a) SSD
(c) RetinaNet
(b) YOLO-v3
(i) Mask RCNN
(j) PlotNet
(d) FRCNN
(e) FrRCNN
(a) SSD
(c) RetinaNet
(f) Mask RCNN
(b) YOLO-v3
(g) FRCNN (FPN+RA)
(i) FrRCNN (FPN+RA)
(h) FrRCNN (RA)
(b) FRCNN (FPN+RA)
(d) FrRCNN (FPN+RA)
(c) FrRCNN (RA)
(a) FRCNN (RA)
(a) SSD
(b) YOLO-v3
(c) RetinaNet
(d) FRCNN
(e) FrRCNN
(f) MRCNN
(a) SSD
YOLO
RetinaNet
FRCNN
FrRCNN
MRCNN
(a) FRCNN (RA)
(b) FRCNN (FPN+RA)
(c) FrRCNN (RP)
(d) FrRCNN (RA)
(e) FrRCNN (FPN+RA)
(f) SSD
(g) RetinaNet
(h) YOLO-v3
(j) PlotNet
(i) Mask RCNN
(a) FRCNN (RA)
(b) FRCNN (FPN+RA)
(c) FrRCNN (RP)
(d) FrRCNN (RA)
(e) FrRCNN (FPN+RA)
(f) SSD
(g) RetinaNet
(h) YOLO-v3
(i) Mask RCNN
PlotNet-v7
(a)
(b)
(h) PlotNet-v6
Classification scores
Bounding Box
coordinates
Linking
scores
R-22
+ FPN
Feature
Extractor
RGB Image
ROI Mask
Feature Map
14
14
ROI Aligned Features
ROI
Align
1x1
1x1
1x1
1x1
1x1
256
14
64
14
ROI Volumes
AN-ROI
Layer
Final Feature Vector
Output Vectors
320
256
1024
FC
1x1
for every ROI
RH
LH
CH
Classification scores
Bounding Box
coordinates
Linking
scores
R-22
+ FPN
Feature
Extractor
RGB Image
ROI Mask
Feature Map
14
14
ROI Aligned Features
ROI
Align
1x1
1x1
1x1
1x1
1x1
256
14
64
14
ROI Volumes
AN-ROI
Layer
Final Feature Vector
Output Vectors
320
256
1024
FC
1x1
for every ROI
RH
LH
CH
Classification scores
Bounding Box
coordinates
Linking
scores
R-10 or
R-22 +
FPN
Feature
Extractor
RGB Image
ROI Mask
Feature Map
14
14
14
14
256
14
14
14
14
14
14
ROI Aligned Features
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
ROI Volume
AN-ROI Layer
Classification scores
Bounding Box
coordinates
Linking
scores
R-10 or
R-22 +
FPN
Feature
Extractor
RGB Image
ROI Mask
Feature Map
14
14
14
14
256
14
14
14
14
14
14
ROI Aligned Features
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
ROI Volume
AN-ROI Layer
R-10 or
R-22 +
FPN
Feature
Extractor
R-10 or
R-22 +
FPN
Feature
Extractor
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
Classification scores
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
Classification scores
Bounding Box
coordinates
R-10 or
R-22 +
FPN
Feature
Extractor
14
14
14
14
256
14
14
14
14
14
14
ROI
Align
1x1
1x1
1x1
1x1
1x1
64
256
256
256
14
14
14
64
64
64
14
14
14
AN-ROI Layer
Classification scores
Bounding Box
coordinates
Linking
scores
ROI
Align
1
0
2
4
Feature Map
RGB Image
2
0
4
8
R-10
or
R-22 + FPN
ROI Mask
FC1
FC2
Feature Extractor
ROI feature
14x14x256
4 coords
10 class
scores
8 link
scores
for every ROI
Output vector
IOU: 0.53
IOU: 0.71
(a) PASCAL VOC
(b) PlotQA
IOU: 0.52
IOU: 0.77
IOU: 0.97
IOU: 0.52
IOU: 0.77
IOU: 0.97
IOU: 0.53
IOU: 0.71
(a) PASCAL VOC
(b) PlotQA
(21.45, 7240)
(61.57, 7003)
(77.22, 374)
(55.49, 1362)
(30.56, 135)
(7.43, 88.42)
(4.08, 1341)
(1.53, 43.90)
(93.44, 23.04)
(35.70, 1473)
(21.45, 7240)
(61.57, 7003)
(77.22, 374)
(55.49, 1362)
(30.56, 135)
(7.43, 88.42)
(4.08, 1341)
(1.53, 43.90)
(93.44, 23.04)
(35.70, 1473)
High mAP - Low latency
PlotNet- Ours
FrRCNN (FPN+RA)
FrRCNN (RA)
FRCNN (FPN+RA)
Mask R-CNN
FrRCNN
FRCNN
RetinaNet
YOLO-v3
SSD
mAP (in %)
Inference Time (in ms)
(21.45, 7240)
(61.57, 7003)
(77.22, 374)
(55.49, 1362)
(30.56, 135)
(7.43, 88.42)
(4.08, 1341)
(1.53, 43.90)
(93.44, 23.04)
(35.70, 1473)
High mAP - Low latency
mAP (in %)
Inference Time (in ms)
FrRCNN (FPN+RA)
FrRCNN (RA)
FRCNN (FPN+RA)
Mask- RCNN
FrRCNN
FRCNN
RetinaNet
YOLOv3
SSD
PlotNet (Ours)
PlotNet-
Ours
FrRCNN (FPN+RA)
FrRCNN (RA)
FRCNN (FPN+RA)
Mask- RCNN
FrRCNN
FRCNN
RetinaNet
YOLOv3
SSD
FrRCNN (FPN+RA)
FrRCNN (RA)
FRCNN (FPN+RA)
Mask- RCNN
FrRCNN
FRCNN
RetinaNet
YOLOv3
SSD
PlotNet (Ours)
(21.45, 7240)
(61.57, 7003)
(77.22, 374)
(55.49, 1362)
(30.56, 135)
(7.43, 88.42)
(4.08, 1341)
(1.53, 43.90)
(93.44, 23.04)
(35.70, 1473)
(54.16, 7220)
PlotNet- Ours
FrRCNN (FPN+RA)
FrRCNN (RA)
FRCNN (FPN+RA)
Mask R-CNN
FrRCNN
FRCNN
RetinaNet
YOLO-v3
SSD