Incorporating Domain Knowledge in Multilingual, Goal-oriented Neural Dialogue Models

Suman Banerjee

Department of Computer Science and Engineering,

Indian Institute of Technology Madras

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Dialogue Systems

Siri

Cortana

Bixby

Google assistant

Alexa

Apple Homepod

Amazon Echo

Google Home

Modular Architecture

Language Understanding

Dialogue State Tracking

Policy Optimizer

Language Generation

User utterance

System response

Semantic Frame

System Action

Dialogue State

I need a cheap chinese restaurant in the north of town.

request_rest(cuisine=chinese, price=cheap, area=north)

Knowledge Base

request_people( )

Sure, for how many people?

Dialogue Manager

Probabilistic methods in spoken-dialogue systems, Steve J. Young, Philosophical Transactions: Mathematical, Physical and Engineering Sciences, 2000

End-to-End Architecture

Can be directly trained on utterance- response pair data, no intermediate supervision required
Can be easily scaled to new domains
No fixed assumption on the dialogue state structure

User utterance

I need a cheap chinese restaurant in the north of town.

System response

Sure, for how many people?

Knowledge Base

End-to-End Dialogue System

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Goal-oriented Dialogue phases

System: Hello, what can I help you with today?

User: Hi!

User: I'd like to book a table for six people in an expensive restaurant with British food.

System: where should it be ?

User: Somewhere in the north of the town.

System: ok I am on it!

System: API_CALL (British, north, six, expensive)

System: How about The_Place?

User: No I don't like it.

System: Ok, then, what do you think about Fancy_Pub?

User: Yes! lets do it.

System: Great! I'll do the reservation.

User: What is the address?

System: Here it is: Fancy_Pub_Address

User: Thank You

System: You're welcome

The_Place R_cuisine British

The_Place R_location north

The_Place R_address The_Place_Address

The_Place R_price expensive

The_Place R_rating 10

Fancy_pub R_cuisine British

Fancy_pub R_location north

Fancy_pub R_address Fancy_pub_Address

Fancy_pub R_price expensive

Fancy_pub R_rating 8

Goal-oriented Dialogue phases

Pre-KB

System: Hello, what can I help you with today?

User: Hi!

User: I'd like to book a table for six people in an expensive restaurant with British food.

System: where should it be ?

User: Somewhere in the north of the town.

System: ok I am on it!

System: API_CALL (British, north, six, expensive)

System: How about The_Place?

User: No I don't like it.

System: Ok, then, what do you think about Fancy_Pub?

User: Yes! lets do it.

System: Great! I'll do the reservation.

User: What is the address?

System: Here it is: Fancy_Pub_Address

User: Thank You

System: You're welcome

Post-KB

The_Place R_cuisine British

The_Place R_location north

The_Place R_address The_Place_Address

The_Place R_price expensive

The_Place R_rating 10

Fancy_pub R_cuisine British

Fancy_pub R_location north

Fancy_pub R_address Fancy_pub_Address

Fancy_pub R_price expensive

Fancy_pub R_rating 8

Structural Information

Dependency Parse of sentences

Knowledge Graph

Current state of the art models ignore this rich structural information
We exploit this structural information in our model
Couple it with the sequential attention mechanism
Empirically show that such structural information aware representations improve the response generation task

Code Mixing

Speaker 1: Hi, can you help me with booking a table at a restaurant?

Speaker 2: Sure, would you like something in cheap, moderate or expensive?

Speaker 1: Hi, kya tum ek restaurant mein table book karne mein meri help karoge?

Speaker 2: Sure, aap ko kaunsi price range mein chahiye, cheap, moderate ya expensive?

Speaker 1: Hi, tumi ki ekta restaurant ey table book korte amar help korbe?

Speaker 2: Sure, aapni kon price range ey chaan, cheap, moderate na expensive?

Problem

Dialogue of n turns : \( \{ (u_1,s_1),(u_2,s_2),...,(k_1,k_2,...,k_e),...,(u_n,s_n) \}\)
Phases:
- Pre-KB : \( (u_1,s_1,u_2,s_2,...,u_i,s_i ) \)
- KB : \( (k_1,k_2,...,k_e) \)
- Post-KB : \( (u_{i+1},s_{i+1},...,u_n,s_n ) \)
At \( t^{th}\) turn the user utterance \( u_t \) is considered as the query to the system
The system is supposed to generate the response: \( s_t \)

Models:
- Sequential Attention Network - To handle the three phases sequentially
- Graph Convolutional Network based model with Sequential Attention - To incorporate structural information
Dataset
- Code-mixed dialogue dataset

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Single Attention Distribution

Current models use a single long RNN or Memory Network to encode the entire context
Compute a single attention distribution over these representations
This way of computing the attention overburdens the attention mechanism

Sequential Attention

Pre-KB

Post-KB

Sequential Attention

Post-KB RNN

Post-KB

\( \mathbf{h}^P_t = BiRNN(\mathbf{h}^P_{t-1},\mathbf{g}_t) \)
\( \mathbf{h}^Q_t = BiRNN(\mathbf{h}^Q_{t-1},\mathbf{q}_t) \)
\( \boldsymbol{\alpha}_t = f(\mathbf{h}^P_t,\mathbf{d}_{t-1},\mathbf{h}^Q_T)\)
\( \mathbf{h}_{post} = \sum_{j=1}^{p} \alpha_{jt}\mathbf{h}_j^P \)

Query RNN

\(\boldsymbol \alpha_t\)

\(\Big\{\)

\(\mathbf{h}_{post}\)

Sequential Attention

\(\boldsymbol \alpha_t\)

\(\Big\{\)

\(\mathbf{h}_{post}\)

\( \mathbf{p}^b_i = softmax(\mathbf{u}^bW^b_m\mathbf{h}^K_i)\)
\( \mathbf{o}^b = \sum_{j=1}^e p^b_j\mathbf{h}^K_j \)
\( \mathbf{u}^{b+1} = \mathbf{u}^b + \mathbf{o}^b \)
Final hop's output vector: \( \mathbf{h}_{kb} = \mathbf{o}^{B+1} \)

KB Memory Network

Sequential Attention

\(\boldsymbol \alpha_t\)

\(\Big\{\)

\(\mathbf{h}_{post}\)

Pre-KB

\( \mathbf{h}_t^R = BiRNN(\mathbf{h}_{t-1}^R,\mathbf{x}_t)\)
\( \boldsymbol \beta_t = f(\mathbf{h}_j^R , \mathbf{d}_{t-1} , \mathbf{h}^Q_T , \mathbf{h}_{post} , \mathbf{h}_{kb})\)
\( \mathbf{h}_{pre} = \sum_{j=1}^r \beta_{jt}\mathbf{h}^R_j \)

\(\boldsymbol \beta_t\)

\(\Big\}\)

\(\mathbf{h}_{pre}\)

End-to-End Network

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Graph Convolutional Network (GCN)

GCN computes representations for the nodes of a graph by looking at the neighbourhood of the node
Formally, \( \mathcal{G = (V,E)}\) be a graph
Let \( \mathcal{X} \in \mathbb{R}^{n \times m}\) be the input feature matrix for \(n\) nodes
Each node (\(\mathbf{x}_u \in \mathbb{R}^m\)) is an m-dimensional feature vector.
The output of 1-hop GCN is \( \mathcal{H} \in \mathbb{R}^{n \times d}\)
Each \( \mathbf{h}_v \in \mathbb{R}^{d}\) is a node representation that captures the 1-hop neighbourhood information

\mathbf{h}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (W^k\mathbf{h}_u^{k} + \mathbf{b}^k) \bigg) \quad , \quad \forall v \in \mathcal{V}

Here \(k\) is the hop number
\(\mathbf{h}^1_u = \mathbf{x}_u \)
\( \mathcal{N}(v)\) is the set of neighbours of node \(v\)

Semi-supervised classification with graph convolutional networks. Kipf and Welling, ICLR, 2017.

Problem

Dialogue : \( \{ (u_1,s_1),(u_2,s_2),...,(k_1,k_2,...,k_e),...,(u_n,s_n) \} \)
Each \(k_i\) is of the form: (entity\(_1\), relation, entity\(_2\))
The KB triples can be represented as a graph : \( \mathcal{G}_k = (\mathcal{V}_k,\mathcal{E}_k) \)
- where \(\mathcal{V}_k\) is the set of entities in the KB triples
- \( \mathcal{E}_k\) is the set of edges where each edge is of the form : (entity\(_1\), entity\(_2\), relation)
At the \( t^{th}\) turn of the dialogue, given the:
- Dialogue History: H = \( (u_1,s_1,...,s_{t-1}) \)
- The current user utterance as query: Q = \( u_t\)
- The knowledge graph \(\mathcal{G}_k\)
The task is to generate the current system utterance \(s_t\)

Syntactic GCNs with RNN

\mathbf{a}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (V^{k}_{dir(u,v)}\mathbf{a}_u^{k} + \mathbf{o}^{k}_{dir(u,v)}) \bigg), \forall v \in \mathcal{V}_H

Obtain the dependency graph for the dialogue history : \( \mathcal{G}_H =(\mathcal{V}_H,\mathcal{E}_H)\)

RNN-Encoder

Syntactic GCNs with RNN

\mathbf{a}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (V^{k}_{dir(u,v)}\mathbf{a}_u^{k} + \mathbf{o}^{k}_{dir(u,v)}) \bigg), \forall v \in \mathcal{V}_H

\mathbf{s}_t = BiRNN_H(\mathbf{s}_{t-1},\mathbf{p}_t)

GCN

RNN - GCN

GCN with Sequential Attention

Query Attention

\[ \alpha_{jt} = f_1(\mathbf{c}^f_j, \mathbf{d}_{t-1}) \]

\[ \mathbf{h}^Q_t =\sum_{j'=1}^{|Q|} \alpha_{j't}\mathbf{c}_{j'}^f \]

History Attention

\[\beta_{jt} = f_2(\mathbf{a}^f_j, \mathbf{d}_{t-1}, \mathbf{h}^Q_t)\]

\[ \mathbf{h}^H_t = \sum_{j'=1}^{|H|} \beta_{j't}\mathbf{a}_{j'}^f \]

KB Attention

\[ \gamma_{jt} = f_3(\mathbf{r}^f_j,\mathbf{d}_{t-1}, \mathbf{h}^Q_t,\mathbf{h}^H_t)\]

\[ \mathbf{h}^K_t = \sum_{j'=1}^m \gamma_{j't}\mathbf{r}_{j'}^f \]

GCNs for code-mixed utterances

Dependency Parsers are not available for code-mixed sentences
Need an alternate way of extracting structural information
Create a global word co-occurrence matrix from the entire corpus
The context window of a word is the entire length of its sentence
Connect edges between two words if their co-occurrence frequency is above a certain threshold value
We experiment with raw frequency and Positive Pointwise Mutual Information (PPMI)\(^1\) values
Decided the threshold by taking the median of non-zero entries in the matrix

\(^1\)Word association norms mutual information, and lexicography, Church and Hanks, Computational Linguistics, 1990

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Modified DSTC2

System: Hello, what can I help you with today?

User: Hi!

User: I'd like to book a table for six people in an expensive restaurant with British food.

System: where should it be ?

User: Somewhere in the north of the town.

System: ok I am on it!

System: API_CALL (British, north, six, expensive)

The_Place R_cuisine British

The_Place R_location north

The_Place R_address The_Place_Address

The_Place R_price expensive

The_Place R_rating 10

Fancy_pub R_cuisine British

Fancy_pub R_location north

Fancy_pub R_address Fancy_pub_Address

Fancy_pub R_price expensive

Fancy_pub R_rating 8

System: How about The_Place?

User: No I don't like it.

System: Ok, then, what do you think about Fancy_Pub?

User: Yes! lets do it.

System: Great! I'll do the reservation.

User: What is the address?

System: Here it is: Fancy_Pub_Address

User: Thank You

System: You're welcome

Learning end-to-end goal-oriented dialog, Bordes et. al. , ICLR, 2017.

Code-mixed Data Collection

Build on top of modified DSTC2 dataset
Collect code-mixed dialogues using this dataset in four languages :
- Hindi-English, Bengali-English, Gujarati-English and Tamil-English.

Extract Unique

Utterances

Replace entities

with placeholders

Unique

utterances

Crowdsource

code-mixing

utterance templates

Replace placeholders

with entities

code-mixed templates

Replace utterances

back into dialogue

code-mixed

utterances

code-mixed dialogue data

English

dialogue data

Sorry there is no Chinese restaurant in the west part of town

Sorry there is no Italian restaurant in the north part of town

Sorry there is no <CUISINE> restaurant in the <AREA> part of town

Outline

Introduction
Problem Statement
Proposed Models
- Sequential Attention Network
- GCN with Sequential Attention
Proposed Dataset
- Code-mixed dialogue
Results and Analysis
Conclusion

Results on En-DSTC2

RNN + CROSS -GCN-SeA

Connect edges between query/history words and KB entities if they exactly match
Creates one global graph, encoded using one GCN
Then separated into different contexts to perform the sequential attention

Results on code-mixed data

Effect of using more GCN hops

PPMI vs Raw Frequencies

Using the PPMI scores gives a better contextual graph compared to using raw frequency
Evident across all other languages except English

Dependency or co-occurrence structure really needed ?

Dependency edges

Random edges

Ablations

RNN

GCN

RNN-GCN

Encoder :

Attention :

Bahdanau

Sequential

GCN

RNN-GCN

Sequential

RNN

GCN

RNN-GCN

Ablations

GCNs do not outperform RNNs independently:
- performance of GCN-Bahdanau attention < RNN-Bahdanau attention
Our Sequential attention outperforms Bahdanau attention:
- GCN-Bahdanau attention < GCN-Sequential attention
- RNN-Bahdanau attention < RNN-Sequential attention (BLEU & ROUGE)
- RNN+GCN-Bahdanau attention < RNN+GCN-Sequential attention
Combining GCNs with RNNs helps:
- RNN-Sequential attention < RNN+GCN-Sequential attention
Best results are always obtained by the final model which combines RNN, GCN and Sequential attention

Results on Human-Human Dialogue Datasets

Two dialogue datasets with human-human conversations
- Cam676: 676 dialogues from the restaurant domain
- MultiWOZ: Each dialogue can span multiple domains
  - SNG: 3406 dialogues in single domain
  - MUL: 10,438 dialogues in multiple domains
Two Metrics for goal completeness:
- Match: Correct entity was suggested according to dialogue goal
- Success: If it is a match and correct requestable slots were produced

Results on Human-Human Dialogue Datasets

Results on Cam676

Results on MultiWOZ

Human Evaluations

Randomly chosen responses from 200 dialogues of En-DSTC2 and Cam676
Pairwise comparison task between our model and best baseline
Evaluators could choose tie if both responses were good or bad

Conclusion

A single attention distribution overburdens the attention mechanism
Separated the history into Pre-KB, KB and Post-KB parts and attended sequentially over them
Showed that structure-aware representations are useful in goal-oriented dialogue
Used GCNs to infuse structural information of dependency graphs into the learned representations
Introduced a goal-oriented code-mixed dialogue dataset for four languages
When dependency parsers are not available, we used word co-occurrence frequencies and PPMI values to extract a contextual graph
Obtained state-of-the-art performance on four datasets.

Future Work

Extend the model to multidomain goal-oriented dialogue (restaurants, hotels, taxi)
Conditional code-mixed response generation
Use the whole Knowledge-Graph instead of dialogue specific KB triples
Use semantic graphs along with dependency parse trees

Publications

Suman Banerjee, Nikita Moghe, Siddhartha Arora and Mitesh M. Khapra, "A Dataset for building Code-Mixed Goal Oriented Conversation Systems", In the Proceedings of the 27th International Conference on Computational Linguistics, COLING, Santa Fe, New-Mexico, USA, August 2018.

Suman Banerjee and Mitesh M. Khapra, "Graph Convolutional Network with Sequential Attention For Goal-Oriented Dialogue Systems", Transactions of the Association for Computational Linguistics (TACL), 2019.

Thank You

Language Understanding

User utterance

Predicted

Intent

I need a cheap chinese restaurant in the north of town.

Intent Classification

request_rest
request_address
request_phone
.
.
.
book_table

Given a collection of utterances \( u_i \) and intent labels \( l_i : D = \{ (u_1,l_1),(u_2,l_2), \dots, (u_n,l_n) \} \), train a model to predict the intent for each utterance.

Classifier