Incorporating Domain Knowledge in Multilingual, Goal-oriented Neural Dialogue Models

Suman Banerjee

 

Department of Computer Science and Engineering,

Indian Institute of Technology Madras

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Dialogue Systems

Siri

Cortana

Bixby

Google assistant

Alexa

Apple Homepod

Amazon Echo

Google Home

Two Paradigms

Challenges

  • Variability in natural language
  • Represent the meanings of utterances
  • Maintaining context over long turns
  • Incorporating world/domain knowledge

Chit-Chat

  • Has no specific end goals
  • Used for user engagement
  • Focus on generating natural responses

Goal-Oriented

  • Used to carry out a particular task
  • Requires background knowledge
  • Focus on generating informative responses that leads to task completion

Two Paradigms

Challenges

  • Variability in natural language
  • Represent the meanings of utterances
  • Maintaining context over long turns
  • Incorporating world/domain knowledge

Chit-Chat

  • Has no specific end goals
  • Used for user engagement
  • Focus on generating natural responses

Goal-Oriented

  • Used to carry out a particular task
  • Requires background knowledge
  • Focus on generating informative responses that leads to task completion

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Modular Architecture

Language Understanding

Dialogue State Tracking

Policy Optimizer

Language Generation

User utterance

System response

Semantic Frame

System Action

Dialogue State

I need a cheap chinese restaurant in the north of town.

request_rest(cuisine=chinese, price=cheap, area=north)

Knowledge Base

request_people( )

Sure, for how many people?

Dialogue Manager

Probabilistic methods in spoken-dialogue systems, Steve J. Young, Philosophical Transactions: Mathematical, Physical and Engineering Sciences, 2000

Drawbacks

  • Annotation of output labels required for each module
  • Fixed Assumption on the dialogue state structure
  • Difficult to scale to new domains
  • Error propagation from previous modules

Language Understanding

Dialogue State Tracking

Policy Optimizer

Language Generation

User utterance

System response

Semantic Frame

System Action

Dialogue State

I need a cheap chinese restaurant in the north of town.

request_rest(cuisine=chinese, price=cheap, area=north)

Knowledge Base

request_people( )

Sure, for how many people?

Dialogue Manager

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

End-to-End Architecture

  • Can be directly trained on utterance- response pair data, no intermediate supervision required
  • Can be easily scaled to new domains
  • No fixed assumption on the dialogue state structure
  • Can handle out-of-vocabulary (OOV) words

User utterance

I need a cheap chinese restaurant in the north of town.

System response

Sure, for how many people?

Knowledge Base

End-to-End Dialogue System

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Modified DSTC2
    • Code-mixed dialogue
  • Results and Analysis
  • Conclusion

Sequence-to-Sequence Models

Sequence-to-Sequence Learning with Neural Networks, Sutskever et.al., NeurIPS, 2014

Encoder

Decoder

Sequence-to-Sequence Models

Attention:

  • \( \alpha_{it} = f(\mathbf{h}_{i},\mathbf{d}_{t-1})\)
  • \( \mathbf{c}_t = \sum^{k}_{j=1}\alpha_{jt}\mathbf{h}_j\)

  Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau et. al., ICLR, 2015

\( \alpha_{it}\)

\( \mathbf{c}_{t}\)

\( \mathbf{h}_{i}\)

\( \mathbf{d}_{t}\)

Encoder

Decoder

Hierarchical Recurrent Encoder Decoder

Building End-to-End Dialogue Systems Using Generative Hierarchical Neural Network Models, Serban et. al., AAAI, 2016

Memory Networks

  • Dialogue History : 

\( \mathbf{q}^1\)

\( \mathbf{q}^b\)

\( \mathbf{q}^{b+1}\)

\( \mathbf{q}^{b+1}\)

\( \mathbf{o}^b\)

\( \mathbf{p}^b\)

Candidates

\( \mathbf{q}^b\)

\( \mathbf{q}^{B+1}\)

\( W\)

\( \mathbf{\hat{z}}\)

\( BOW(y_i)\)

End-to-End Memory Networks, Sukhbaatar et. al., NeurIPS 2015

User \((u_1)\):  Hello!

System \((s_1)\):   How can I help you today?

...

System \((s_{t-1})\): How about Fancy_Pub?

  • Memory Representation: 

\(\mathbf{m}_2 =  \)

\(BOW(s_1)\)

\(A \cdot\)

  • Current User Utterance (Query):

User \((u_t)\):  I don't like it

\(\mathbf{q} =  \)

\(BOW(u_t)\)

\(C \cdot\)

  • Attention :
    • \( a^b_i        = \mathbf{q}^{bT}\mathbf{m}_i \)
    • \( \mathbf{p}^b       = softmax(\mathbf{a}^b) \)
    • \( \mathbf{o}^b       = \sum_{i=1}^e \mathbf{m}_i p^b_i\)
    • \( \mathbf{q}^{b+1}= \mathbf{o}^b + \mathbf{q}^b\)
  • Candidate responses : 

\(y_1\):  Let me find another one

\(y_2\):  How about The_Place

...

\(y_{c}\):  Sorry, there are no other pubs

  • Prediction :
    • \( e_i = \mathbf{q}^{B+1}\cdot W\cdot BOW(y_i)\)
    • \( \mathbf{\hat{z}} = softmax(\mathbf{e})\)

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Goal-oriented Dialogue phases

System: Hello, what can I help you with today?

 User:  Hi!

 User:  I'd like to book a table for six people in an expensive restaurant with British food.

System: where should it be ?

 User:  Somewhere in the north of the town.

System: ok I am on it!

System: API_CALL (British, north, six, expensive)

System: How about The_Place?

 User:  No I don't like it.

System: Ok, then, what do you think about Fancy_Pub?

 User:  Yes! lets do it.

System: Great! I'll do the reservation.

 User:  What is the address?

System: Here it is: Fancy_Pub_Address

 User:  Thank You

System: You're welcome

 

 The_Place R_cuisine British

 The_Place R_location north

 The_Place R_address The_Place_Address

 The_Place R_price expensive

 The_Place R_rating 10

 Fancy_pub R_cuisine British

 Fancy_pub R_location north

 Fancy_pub R_address Fancy_pub_Address

 Fancy_pub R_price expensive

 Fancy_pub R_rating 8

 

Goal-oriented Dialogue phases

Pre-KB

System: Hello, what can I help you with today?

 User:  Hi!

 User:  I'd like to book a table for six people in an expensive restaurant with British food.

System: where should it be ?

 User:  Somewhere in the north of the town.

System: ok I am on it!

System: API_CALL (British, north, six, expensive)

System: How about The_Place?

 User:  No I don't like it.

System: Ok, then, what do you think about Fancy_Pub?

 User:  Yes! lets do it.

System: Great! I'll do the reservation.

 User:  What is the address?

System: Here it is: Fancy_Pub_Address

 User:  Thank You

System: You're welcome

KB

Post-KB

 

 The_Place R_cuisine British

 The_Place R_location north

 The_Place R_address The_Place_Address

 The_Place R_price expensive

 The_Place R_rating 10

 Fancy_pub R_cuisine British

 Fancy_pub R_location north

 Fancy_pub R_address Fancy_pub_Address

 Fancy_pub R_price expensive

 Fancy_pub R_rating 8

 

Structural Information

Dependency Parse of sentences

Knowledge Graph

  • Current state of the art models ignore this rich structural information
  • We exploit this structural information in our model
  • Couple it with the sequential attention mechanism
  • Empirically show that such structural information aware representations improve the response generation task

Code Mixing

Speaker 1: Hi, can you help me with booking a table at a restaurant?

Speaker 2: Sure, would you like something in cheap, moderate or expensive?

Speaker 1: Hi, kya tum ek restaurant mein table book karne mein meri help karoge?

Speaker 2: Sure, aap ko kaunsi price range mein chahiye, cheap, moderate ya expensive?

Speaker 1: Hi, tumi ki ekta restaurant ey table book korte amar help korbe?

Speaker 2: Sure, aapni kon price range ey chaan, cheap, moderate na expensive?

Problem

  • Dialogue of n turns : \( \{ (u_1,s_1),(u_2,s_2),...,(k_1,k_2,...,k_e),...,(u_n,s_n) \}\)
  • Phases:
    • Pre-KB : \( (u_1,s_1,u_2,s_2,...,u_i,s_i )  \) 
    • KB : \( (k_1,k_2,...,k_e) \)
    • Post-KB : \( (u_{i+1},s_{i+1},...,u_n,s_n ) \) 
  • At \( t^{th}\) turn the user utterance \( u_t \) is considered as the query to the system
  • The system is supposed to generate the response: \( s_t \)
  • Models:
    • Sequential Attention Network  - To handle the three phases sequentially
    • Graph Convolutional Network based model with Sequential Attention - To incorporate structural information
  • Dataset
    • Code-mixed dialogue dataset
    • Dialogue dataset with unstructured background knowledge

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Single Attention Distribution

  • Current models use a single long RNN or Memory Network to encode the entire context
  • Compute a single attention distribution over these representations
  • This way of computing the attention overburdens the attention mechanism

Sequential Attention

Pre-KB

Post-KB

KB

Sequential Attention

Post-KB RNN

Post-KB

  •  \( \mathbf{h}^P_t  = BiRNN(\mathbf{h}^P_{t-1},\mathbf{g}_t) \)
  •  \( \mathbf{h}^Q_t  = BiRNN(\mathbf{h}^Q_{t-1},\mathbf{q}_t) \)
  • \( \boldsymbol{\alpha}_t     = f(\mathbf{h}^P_t,\mathbf{d}_{t-1},\mathbf{h}^Q_T)\)
  • \( \mathbf{h}_{post} = \sum_{j=1}^{p} \alpha_{jt}\mathbf{h}_j^P \)

Query RNN

\(\boldsymbol \alpha_t\)

\(\Big\{\)

\(\mathbf{h}_{post}\)

Sequential Attention

\(\boldsymbol \alpha_t\)

\(\Big\{\)

\(\mathbf{h}_{post}\)

KB

  • \( \mathbf{p}^b_i = softmax(\mathbf{u}^bW^b_m\mathbf{h}^K_i)\)
  • \( \mathbf{o}^b = \sum_{j=1}^e p^b_j\mathbf{h}^K_j \)
  • \( \mathbf{u}^{b+1} = \mathbf{u}^b + \mathbf{o}^b \)
  • Final hop's output vector: \( \mathbf{h}_{kb} = \mathbf{o}^{B+1} \)

KB Memory Network

Sequential Attention

\(\boldsymbol \alpha_t\)

\(\Big\{\)

\(\mathbf{h}_{post}\)

Pre-KB

  • \( \mathbf{h}_t^R = BiRNN(\mathbf{h}_{t-1}^R,\mathbf{x}_t)\)
  • \( \boldsymbol \beta_t = f(\mathbf{h}_j^R , \mathbf{d}_{t-1} , \mathbf{h}^Q_T , \mathbf{h}_{post} , \mathbf{h}_{kb})\)
  • \( \mathbf{h}_{pre} = \sum_{j=1}^r \beta_{jt}\mathbf{h}^R_j \)

\(\boldsymbol \beta_t\)

\(\Big\}\)

\(\mathbf{h}_{pre}\)

End-to-End Network

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Graph Convolutional Network (GCN)

image/svg+xml
image/svg+xml
  • GCN computes representations for the nodes of a graph by looking at the neighbourhood of the node
  • Formally, \( \mathcal{G = (V,E)}\) be a graph
  • Let \( \mathcal{X} \in \mathbb{R}^{n \times m}\) be the input feature matrix for \(n\) nodes
  • Each node (\(\mathbf{x}_u \in \mathbb{R}^m\)) is an m-dimensional feature vector.
  • The output of 1-hop GCN is \( \mathcal{H} \in \mathbb{R}^{n \times d}\)
  • Each \( \mathbf{h}_v \in \mathbb{R}^{d}\) is a node representation that captures the 1-hop neighbourhood information
\mathbf{h}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (W^k\mathbf{h}_u^{k} + \mathbf{b}^k) \bigg) \quad , \quad \forall v \in \mathcal{V}
hvk+1=ReLU(uN(v)(Wkhuk+bk)),vV\mathbf{h}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (W^k\mathbf{h}_u^{k} + \mathbf{b}^k) \bigg) \quad , \quad \forall v \in \mathcal{V}
  • Here \(k\) is the hop number
  • \(\mathbf{h}^1_u = \mathbf{x}_u \)
  • \( \mathcal{N}(v)\) is the set of neighbours of node \(v\)

Semi-supervised classification with graph convolutional networks. Kipf and Welling, ICLR, 2017.

Problem

  • Dialogue :  \( \{ (u_1,s_1),(u_2,s_2),...,(k_1,k_2,...,k_e),...,(u_n,s_n) \} \)
  • Each \(k_i\) is of the form: (entity\(_1\), relation, entity\(_2\))
  • The KB triples  can be represented as a graph : \( \mathcal{G}_k = (\mathcal{V}_k,\mathcal{E}_k) \)
    • where \(\mathcal{V}_k\) is the set of entities in the KB triples
    • \( \mathcal{E}_k\) is the set of edges where each edge is of the form : (entity\(_1\), entity\(_2\), relation)
  • At the \( t^{th}\) turn of the dialogue, given the:
    • Dialogue History: H = \( (u_1,s_1,...,s_{t-1}) \)
    • The current user utterance as query: Q = \( u_t\)
    • The knowledge graph \(\mathcal{G}_k\)
  • The task is to generate the current system utterance \(s_t\)

Syntactic GCNs with RNN

\mathbf{a}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (V^{k}_{dir(u,v)}\mathbf{a}_u^{k} + \mathbf{o}^{k}_{dir(u,v)}) \bigg), \forall v \in \mathcal{V}_H
avk+1=ReLU(uN(v)(Vdir(u,v)kauk+odir(u,v)k)),vVH\mathbf{a}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (V^{k}_{dir(u,v)}\mathbf{a}_u^{k} + \mathbf{o}^{k}_{dir(u,v)}) \bigg), \forall v \in \mathcal{V}_H
  • Obtain the dependency graph for the dialogue history : \( \mathcal{G}_H =(\mathcal{V}_H,\mathcal{E}_H)\)

 

RNN-Encoder

Syntactic GCNs with RNN

\mathbf{a}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (V^{k}_{dir(u,v)}\mathbf{a}_u^{k} + \mathbf{o}^{k}_{dir(u,v)}) \bigg), \forall v \in \mathcal{V}_H
avk+1=ReLU(uN(v)(Vdir(u,v)kauk+odir(u,v)k)),vVH\mathbf{a}_v^{k+1} = ReLU \bigg( \sum_{u \in \mathcal{N}(v)} (V^{k}_{dir(u,v)}\mathbf{a}_u^{k} + \mathbf{o}^{k}_{dir(u,v)}) \bigg), \forall v \in \mathcal{V}_H
\mathbf{s}_t = BiRNN_H(\mathbf{s}_{t-1},\mathbf{p}_t)
st=BiRNNH(st1,pt)\mathbf{s}_t = BiRNN_H(\mathbf{s}_{t-1},\mathbf{p}_t)

GCN

RNN - GCN

GCN with Sequential Attention

Query Attention

\[ \alpha_{jt} = f_1(\mathbf{c}^f_j, \mathbf{d}_{t-1}) \]

\[ \mathbf{h}^Q_t =\sum_{j'=1}^{|Q|} \alpha_{j't}\mathbf{c}_{j'}^f \]

History Attention

\[\beta_{jt} = f_2(\mathbf{a}^f_j, \mathbf{d}_{t-1}, \mathbf{h}^Q_t)\]

\[    \mathbf{h}^H_t =  \sum_{j'=1}^{|H|} \beta_{j't}\mathbf{a}_{j'}^f \]

KB Attention

\[ \gamma_{jt} = f_3(\mathbf{r}^f_j,\mathbf{d}_{t-1}, \mathbf{h}^Q_t,\mathbf{h}^H_t)\]

\[ \mathbf{h}^K_t =   \sum_{j'=1}^m \gamma_{j't}\mathbf{r}_{j'}^f \]

GCNs for code-mixed utterances

  • Dependency Parsers are not available for code-mixed sentences
  • Need an alternate way of extracting structural information
  • Create a global word co-occurrence matrix from the entire corpus
  • The context window of a word is the entire length of its sentence
  • Connect edges between two words if their co-occurrence frequency is above a certain threshold value
  • We experiment with raw frequency and Positive Pointwise Mutual Information (PPMI)\(^1\) values
  • Decided the threshold by taking the median of non-zero entries in the matrix

\(^1\)Word association norms mutual information, and lexicography, Church and Hanks, Computational Linguistics, 1990

  • The_Place is a nice restaurant that serves British food.
  • Fancy_pub is a nice restaurant that serves British food.
  • Prezzo is a nice restaurant that serves Italian food.
  • <restaurant> is a nice restaurant that serves <cuisine> food.
  • To inject proper entities in the responses we need a different mechanism
  • The backbone of the sentences should be generated from the vocabulary
  • The correct entities should be copied from the KB into the correct position of the generated response
y: The_Place serves British food and the prices are expensive

\( labelc \)   

7 9 1 9 9 9 9 9 4
Chinese British Italian cheap expensive moderate Fancy_Pub The_Place Prezzo #

\( Memory =  \)   

0 1 2 3 4 5 6 7 8 9

Copy Mechanism

Copy Mechanism

Memory Network

\( P^k = softmax(r^fC^kq^k)\)

\( q^{k+1} =  q^k + \sum_{j=0}^{m}P^k_jr^f_j\)

\( q^{1} = d_{t}\)

\hat{y}_t = \left\{ \begin{array}{c l} argmax(P_{vocab}) &: \text{if } argmax(P_{copy}) = \\ argmax(P_{copy}) &: \text{otherwise} \end{array}\right.
y^t={argmax(Pvocab):if argmax(Pcopy)=argmax(Pcopy):otherwise\hat{y}_t = \left\{ \begin{array}{c l} argmax(P_{vocab}) &amp;: \text{if } argmax(P_{copy}) = \\ argmax(P_{copy}) &amp;: \text{otherwise} \end{array}\right.

'#'

\( P_{vocab} = softmax(V'd_{t} + b')\)

\mathscr{L}_t = -\log P_{vocab}(y_{t}) - \log P_{copy}(labelc_t)
Lt=logPvocab(yt)logPcopy(labelct)\mathscr{L}_t = -\log P_{vocab}(y_{t}) - \log P_{copy}(labelc_t)
  • \(r^f\) : KB-GCN representations 
  • \(C^k\) : Parameter Matrix

Mem2seq: Effectively incorporating knowledge bases into end-to-end task oriented dialog systems, Madotto et. al., ACL, 2018

Copy Mechanism

Memory

Generated

Response

Heat map for \( P_{copy}\) across all decoder time steps

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Modified DSTC2

System: Hello, what can I help you with today?

 User:  Hi!

 User:  I'd like to book a table for six people in an expensive restaurant with British food.

System: where should it be ?

 User:  Somewhere in the north of the town.

System: ok I am on it!

System: API_CALL (British, north, six, expensive)

 

 The_Place R_cuisine British

 The_Place R_location north

 The_Place R_address The_Place_Address

 The_Place R_price expensive

 The_Place R_rating 10

 Fancy_pub R_cuisine British

 Fancy_pub R_location north

 Fancy_pub R_address Fancy_pub_Address

 Fancy_pub R_price expensive

 Fancy_pub R_rating 8

 

System: How about The_Place?

 User:  No I don't like it.

System: Ok, then, what do you think about Fancy_Pub?

 User:  Yes! lets do it.

System: Great! I'll do the reservation.

 User:  What is the address?

System: Here it is: Fancy_Pub_Address

 User:  Thank You

System: You're welcome

Learning end-to-end goal-oriented dialog, Bordes et. al. , ICLR, 2017.

Code-mixed Data Collection

  • Build on top of modified DSTC2 dataset
  • Collect code-mixed dialogues using this dataset in four languages :
    • Hindi-English, Bengali-English, Gujarati-English and Tamil-English.

Extract Unique

Utterances

Replace entities

with placeholders

Unique

utterances

Crowdsource

code-mixing

utterance templates

Replace placeholders

with entities

code-mixed templates

Replace utterances

back into dialogue

code-mixed

utterances

code-mixed dialogue data

English

dialogue data

Sorry there is no Chinese restaurant in the west part of town

Sorry there is no Italian restaurant in the north part of town

Sorry there is no <CUISINE> restaurant in the <AREA> part of town

Quantification of Code-mixing

  • Code-mixing : Foreign language words embedded into a Native (Matrix) language sentence.
C_u(x) =\left\{ \begin{array}{ll} \frac{N(x)-\underset{L_{i} \in \mathcal{L}} \text{max} \{ t_{L_i} \}}{N(x)} &: N(x)>0\\ 0 &: N(x)=0 \end{array}\right.
Cu(x)={N(x)maxLiL{tLi}N(x):N(x)&gt;00:N(x)=0C_u(x) =\left\{ \begin{array}{ll} \frac{N(x)-\underset{L_{i} \in \mathcal{L}} \text{max} \{ t_{L_i} \}}{N(x)} &amp;: N(x)&gt;0\\ 0 &amp;: N(x)=0 \end{array}\right.
  • where \( x\) is an utterance
  • \( N(x) \) is the number of language specific tokens in \(x\)
  • \( \mathcal{L} \) is the set of all languages and \( t_{L_i}\) is the number of tokens of language \(L_i\)
  • Consider the number of language switch points in an utterance : \( \frac{P(x)}{N(x)}\) 
C_u(x) = 100\cdot\frac{N(x) - \underset{L_{i} \in \mathcal{L} } \text{max} \{ t_{L_i} \} + P(x)}{2N(x)}: (\textit{if}~N(x) >0)
Cu(x)=100N(x)maxLiL{tLi}+P(x)2N(x):(if N(x)&gt;0)C_u(x) = 100\cdot\frac{N(x) - \underset{L_{i} \in \mathcal{L} } \text{max} \{ t_{L_i} \} + P(x)}{2N(x)}: (\textit{if}~N(x) &gt;0)
C_{avg} = \frac{1}{U}\sum_{i=1}^{U}C_u(x_i)
Cavg=1Ui=1UCu(xi)C_{avg} = \frac{1}{U}\sum_{i=1}^{U}C_u(x_i)

Comparing the level of code-switching in corpora, Björn Gambäck and Amitava Das, LREC, 2016

  • "Prezzo ek accha restaurant hain in the north part of town jo tasty chinese food serve karta hain."

Quantification of Code-mixing

Comparing the level of code-switching in corpora, Björn Gambäck and Amitava Das, LREC, 2016

  • We replace the \( \text{max} \) with the following:
\mathit{native}(x) = \left\{ \begin{array}{c l} t_{L_n} &: t_{L_n} > 0 \\ N(x) &: t_{L_n} = 0 \end{array}\right.
native(x)={tLn:tLn&gt;0N(x):tLn=0\mathit{native}(x) = \left\{ \begin{array}{c l} t_{L_n} &amp;: t_{L_n} &gt; 0 \\ N(x) &amp;: t_{L_n} = 0 \end{array}\right.
C_c=\frac{100}{U}\left[ \frac{1}{2} \sum_{i=1}^{U} \left( 1-\frac{\mathit{native}(x)+P(x)}{N(x)} + \delta(x) \right) + \frac{5}{6}S\right]
Cc=100U[12i=1U(1native(x)+P(x)N(x)+δ(x))+56S]C_c=\frac{100}{U}\left[ \frac{1}{2} \sum_{i=1}^{U} \left( 1-\frac{\mathit{native}(x)+P(x)}{N(x)} + \delta(x) \right) + \frac{5}{6}S\right]
  • With other two other factors :
    • \( \delta(x) \) : for inter-utterance mixing
    • \(S\) : number of code-mixed utterances out of \(U\) utterances.

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Self-chat for Movie Dialogue

PLOT

The British navy requisitions civilian vessels that can get close to the beach. In Weymouth, Mr. Dawson and his son Peter set out on his boat Moonstone rather than let the navy take it. Impulsively, their teenage friend George joins them. At sea, they rescue a shell-shocked officer from a wrecked ship.

REVIEW

Dunkirk is edge of your seat filmmaking.They're all great but Dunkirk is not about any one solider. Also 'Dunkirk' is another brilliant collaboration between Nolan & Hans Zimmer. The way he mixes in a ticking clock with score is nail biting. Dunkirk relies on very little dialogue

 

COMMENTS

This is a very important movie, because it doesn't glamorize or glorify war.

 

Just awesome! Simply awesome!

 

Hans Zimmer did really great with the score

 

The movie was brilliant

DIALOGUE

Speaker1 (N): What do you think about the movie?

Speaker2 (C): I think the movie was brilliant.

Speaker1 (N): Agreed! One of the finest in this genre.

Speaker2 (C): I believe the best part about the movie is that it doesn't glamorize or glorify war.

Speaker1 (N): Totally! Oh by the way do you remember the name of the ship headed by Mr.Dawson ?

Speaker2 (P): Yes. It was Moonstone

Speaker1 (N): Right. I am always impressed by the Nolan - Hans Zimmer collaboration.

Speaker2 (R): The way he mixes in a ticking clock with score is nail biting.

Speaker1 (N): That's an interesting way to put it. Would recommend any other movies ?

Speaker2 (F): I think you will enjoy Saving Private Ryan

Box Office $520,102,431
Similar Movies Saving Private Ryan
Interstellar

FACT TABLE

Self-chat for Products

DESCRIPTION

 

. . . Dual Selfie Camera. One for Selfie, One for Groupfie The OPPO F3 Plus sports dual front cameras for best selfies - A 16 MP camera and a 8 MP 120 wide angle camera. Meanwhile, beautify 4.0 gives even more options to capture head-turning selfies.. The camera uses a tailor-made 6P lens module, which adjusts the path of light to minimize image distortion and maximize clarity. . . .

POSITIVE REVIEW

 

It has Dual Selfie mode with Wide viewing angle which is best in its class.

 

Speaker has quite clean and powerful sound.

NEGATIVE REVIEWS

 

It feels heavy in hands as well as in pocket with 185 grams weight.

 

No support for FM radio and NFC

DIALOGUE

Buyer(N): Hi, Can  you recommend me some good phone to buy?

Seller(N): Hi, Sure I would like to recommend you oppo-f3-plus.

Buyer(RP): Tell me more about the Dual Selfie with Wide viewing angle.

Seller(RP+D): It has Dual Selfie Camera. One for Selfie, One for Groupfie The OPPO F3 Plus sports dual front cameras for best selfies - A 16 MP camera and a 8 MP 120 wide angle camera.

Buyer(D): Ok. How is the performance of the camera?

Seller(D): The camera uses a tailor-made 6P lens module, which adjusts the path of light to minimize image distortion and maximize clarity.

Buyer(RN): As good as the product looks but it has No support for FM radio and NFC.

Seller(RP): No it does not but the Speaker has quite clean and powerful sound.

Buyer(F): Well how much does it cost?

Seller(F): Just 30,990

Buyer(N): Well, ok I will give it a try.

Seller(N): You will surely enjoy the product.

Buyer(N): Buy

Seller(N): Sell

Battery 4000mAh
RAM 4GB
Price 30,990
Internal Memory 64GB
Screen 6 inches

FEATURE  TABLE

Outline

  • Introduction
  • Related Work
    • Modular Architecture
    • End-to-End Architecture
    • Existing Neural Models
  • Problem Statement
  • Proposed Models
    • Sequential Attention Network
    • GCN with Sequential Attention 
  • Proposed Dataset
    • Code-mixed dialogue
    • Unstructured Knowledge
  • Results and Analysis
  • Conclusion

Baseline Results

  • Established initial baseline results using two generation based models:
    • Sequence-to-sequence with attention
    • Hierarchical Recurrent Encoder-Decoder (HRED)
  • per-response accuracy:  exact match accuracy between the generated response and the ground truth.
  • BLEU, ROUGE: n-gram overlap based metric to evaluate generation quality
  • Entity F1: Micro-average of F1 scores between ground truth entities and generated entities.

Baseline Results

  • Baselines from three paradigms:
    • Pure generation based - HRED
    • Generation based model which learn to copy from the background knowledge - GTTP
    • Span prediction based model - BiDAF

Results on En-DSTC2

RNN + CROSS -GCN-SeA

  • Connect edges between query/history words and KB entities if they exactly match
  • Creates one global graph, encoded using one GCN
  • Then separated into different contexts to perform the sequential attention

Results on code-mixed data

Effect of using more GCN hops

PPMI vs Raw Frequencies

  • Using the PPMI scores gives a better contextual graph compared to using raw frequency
  • Evident across all other languages except English

Dependency or co-occurrence structure really needed ?

Dependency edges

Random edges

Ablations

RNN

GCN

RNN-GCN

Encoder : 

Attention :

Bahdanau

Sequential

GCN

RNN-GCN

Sequential

RNN

GCN

RNN-GCN

Ablations

Ablations

  • GCNs do not outperform RNNs independently:

    • performance of GCN-Bahdanau attention < RNN-Bahdanau attention

  • Our Sequential attention outperforms Bahdanau attention:

    • GCN-Bahdanau attention < GCN-Sequential attention

    • RNN-Bahdanau attention < RNN-Sequential attention (BLEU & ROUGE)

    • RNN+GCN-Bahdanau attention < RNN+GCN-Sequential attention

  • Combining GCNs with RNNs helps:

    • RNN-Sequential attention < RNN+GCN-Sequential attention

  • Best results are always obtained by the final model which combines RNN, GCN and Sequential attention

Conclusion

  • A single attention distribution overburdens the attention mechanism

  • Separated the history into Pre-KB, KB and Post-KB parts and attended sequentially over them

  • Showed that structure-aware representations are useful in goal-oriented dialogue

  • Used GCNs to infuse structural information of dependency graphs into the learned representations

  • Introduced a goal-oriented code-mixed dialogue dataset for four languages
  • Quantified the amount of code-mixing present in the dataset
  • Introduced a dialogue dataset with a mix of structured and unstructured background knowledge
  • When dependency parsers are not available, we used word co-occurrence frequencies and PPMI values to extract a contextual graph

  • Obtained state-of-the-art performance on the modified DSTC2 dataset and its code-mixed versions

Future Work

  • Extend the model to multidomain goal-oriented dialogue (restaurants, hotels, taxi)

  • Conditional code-mixed response generation

  • Better copy mechanism

  • Use the whole Knowledge-Graph instead of dialogue specific KB triples

  • Use semantic graphs along with dependency parse trees

Publications

  • A Dataset for building Code-Mixed Goal Oriented Conversation Systems, Suman Banerjee, Nikita Moghe, Siddhartha Arora and Mitesh M. Khapra, In the Proceedings of the 27th International Conference on Computational Linguistics, COLING, Santa Fe, New-Mexico, USA, August 2018.
  • Graph Convolutional Network with Sequential Attention For Goal-Oriented Dialogue Systems, Suman Banerjee and Mitesh M. Khapra, Transactions of the Association for Computational Linguistics (TACL), 2019. (Under Review)
  • Towards Exploiting Background Knowledge for Building Conversation Systems, Nikita Moghe, Siddhartha Arora, Suman Banerjee and Mitesh M. Khapra, In the Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, November 2018.

Questions?

Data Evaluation

  • We code-mixed individual utterances and stitched them back into a dialogue 
  • How do we ensure that the dialogue makes sense:
  • We did Human Evaluations:
    • 100 randomly chosen dialogues
    • 3 evaluators per language
    • 3 criteria for evaluation of a dialogue:
      • Colloquialism: code-mixing was colloquial throughout the dialogue and not forced
      • Intelligibility: dialogue can be understood by a bilingual speaker
      • Coherent: dialogue looks coherent with each utterance fitting appropriately in the context

Language Understanding

Language Understanding

User utterance

Predicted

Intent

I need a cheap chinese restaurant in the north of town.

  • Intent Classification
  • request_rest
  • request_address
  • request_phone
  • .
  • .
  • .
  • book_table
  • Given a collection of utterances \( u_i \) and intent labels \( l_i : D = \{ (u_1,l_1),(u_2,l_2), \dots, (u_n,l_n) \} \), train a model to predict the intent for each utterance.

Classifier

  • Slot Filling

Semantic Frame

request_rest(cuisine=chinese, price=cheap, area=north)

Intent Classification

Intent Classification

Language Understanding

User utterance

Predicted

Tags

I need a cheap chinese restaurant in the north of town.

  • Given a collection of utterances \( u_i \) and intent labels \( l_i : D = \{ (u_1,l_1),(u_2,l_2), \dots, (u_n,l_n) \} \), train a model to predict the intent for each utterance.
  • Given a collection of tagged utterance words \( D = \{ ((u_{i1},u_{i2},...,u_{in_1}),(t_{i1},t_{i2},...,t_{in_1})) \}_{i=1}^n \), train a model to predict the tags for each word of the utterances.
  • Evaluation: Intent accuracy, Slot tagging accuracy or Frame Accuracy

Slot Filling

I <null>
need <null>
a <null>
cheap <price>
chinese <cuisine>
restaurant <null>
in <null>
the <null>
north <area>
of <null>
town <null>

Dialogue Management

  • Dialogue State: Represents the system's belief about the user's goal at any turn in the dialogue.

User: Book a table at Prezzo for 5.

System: How many people?

User: For 3.

#People

Time

  • Dialogue State Tracking:
    • Used to generate API calls to the knowledge base (KB)
    • Provide the results of the KB lookup and the dialogue state to the policy optimizer
  • Policy Optimizer: Given the dialogue state (and additional inputs), generate the next system action.
  • Evaluation: 
    • Turn level: system action accuracy
    • Dialogue level: task completion rate 

Language Generation

Language Generation

inform(rest=Prezzo, cuisine=italian)

System action

System response

Prezzo is a nice restaurant which serves italian.

  • Template based:
    • Map stores keys as system actions and values as natural language patterns.
    • Replace the slots with the retrieved values
  • Recurrent Neural Network (RNN) based: 
    • Use an RNN language model to generate the response conditioned on the system action
  • Evaluation:
    • Subjective: Use human ratings on correctness, grammar, coherence, etc
    • Automatic: BLEU, ROUGE (word overlap based)

Future Plans

  • Synopsis in March
  • Thesis submission before April

Amazon

By suman banerjee