Yugal Sharma
https://yugrocks.github.io/
Linkedin: linkedin.com/in/yugal-sharma-62855713a/
As we construct larger neural networks and train them with more and more data, their performance continues to increase. This in contrast to other Machine Learning techniques in which the performance reaches a plateau.
Slide by andrew Ng
This picture summarizes the main difference between ML and Deep Learning
Each neuron in a layer:
Computation inside a single unit(neuron)
A typical neural network, as an arrangement of units in different layers
This is done using Gradient Descent algorithm with Back-propagation.
Convolution operation by a single filter
Convutional Neural Networks are widely used in computer vision and NLP
Fig1. Using Conv Nets for image classification
Fig2. Using Conv Nets for text classification
Recurrent Neural Networks
Enhancements in the vanilla RNN architecture
LSTM Architecture
Inside a single LSTM cell
The same cell takes different input at each time step, carries out its computation, produces an output and then uses the newly computed hidden state for next time step
Gated Recurrent Unit (GRU)
GRU can be considered as a variation on the LSTM, as both are designed similarly and, in some cases, produce equally excellent results.
Natural Language Processing is a field that aims to make computers understand and manipulate human language, so we can interact with it more easily.
Examples of some successful NLP systems
The conversational interfaces like Alexa, Siri, Cortana, Google Assistant etc. leverage the power of NLP to interact with their users and fulfil their certain wants
Components of NLP
Natural Language Understanding(NLU)
Mapping the given input in natural language into useful representations
Natural Language Generation (NLG)
Producing meaningful phrazes and sentences in the form of natural language from some internal representation
How Deep Learning simplified NLP
We have moved on, from rule based and statistical models which required lot of preprocessing and modeling, to fully end to end and more accurate methods, thanks to Deep Learning.
The process of classifying words into their parts-of-speech and labeling them accordingly is known as Part-of-Speech tagging, POS tagging, or simply tagging.
Text
Some examples of POS tagging
My name is Tony Stark and I am Iron Man
Pronoun (PRP)
Noun (NN)
Verb (VBZ)
Noun (NN)
Noun (NN)
Conjunction (CC)
Pronoun
Verb
Noun
Noun
Mary is reading a book
Noun (NN)
Verb (VBZ)
Verb (VBG)
Noun (NN)
Determiner (DT)
Book a flight for me
Verb (VB)
Determiner (DT)
Noun (NN)
Preposition (PRP)
Pronoun (PRP)
We will explore the last two methods in this talk.
There are a lot of approaches which have been successful. Like-
Using Bidirectional LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Contextual representation
To Fully connected Layer
the
Word representation
(embeddings or one hot)
dog
jumped
Using character level features with the words
In order to make the model consider the spelling based features like:
we can embed character level features as well, by concatenating the character representation vector with the input word vectors
From a paper by Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao,
Part-of-Speech Tagging with Bidirectional Long Short-Term Memory
Recurrent Neural Network
Using character representations along with word representations
How to compute Character level features?
(The diagramatic representation is in the next slide)
This is how we compute character level features for a word
Using ensamble - Bidirectional LSTM+CRF
In the output layer, a linear chain CRF can be used instead of a fully connected softmax layer.
Some results with the deep learning based methods
Huang, Zhiheng et al. “Bidirectional LSTM-CRF Models for Sequence Tagging.” CoRR abs/1508.01991 (2015): n. pag.
Some examples of NER
Mr Wayne is the owner of the Wayne Enterprises
PERSON
ORGANIZATION
I have a meeting with Mr Bill Gates at California
PERSON
LOCATION
I was born in August 1996
DATE
The Clinical Named Entity System (CliNER)
What it does
CliNER architecture
IOB examples
Mr Wayne is the owner of the Wayne Enterprises
I have a meeting with Mr Bill Gates at California
O
B-PERSON
O
O
O
O
O
B-ORG
I-ORG
O
O
O
O
O
O
B-PER
I-PER
O
B-LOCATION
How to do NER?
NER involves different tags which are in the IOB format
This problem can also be broken down into detection of corresponding IOB tags, followed by classification of the tags into the corresponding categories. (like I-PER, B-LOC)
While using the traditional Machine Learning approaches, the POS tags of all the words of the sentences can be first determined and passed as input features for the NER classifier.
However, while using Deep Learning for NER, this is not required. (The next slide)
(IO encoding)
The goal of Language Modeling is to build a language model that can estimate the distribution of natural language as accurate as possible.
A language model computes a probability for a sequence of words:
Traditional language models
Compute a probability distribution over n grams (every n sequence of words) from a large corpus.
There is an incorrect but necessary Markov assumption involved, that the probability of a word Wi, given previous series of words W1, W2,....,W(i-1) should approximately be equal to the probability of Wi, given only n previous words.
Bigrams
Trigrams
Limitations to the above technique
Language Modeling using Recurrent Neural Networks
Given a list of word vectors: x1, x2,......,xT
At a single time step,
LSTM
LSTM
LSTM
Word representation
(embeddings or one hot)
softmax
softmax
softmax
Here,
total.
is the probability distribution over the vocabulary of V words in
The cross entropy loss function is used, but we are predicting words instead of classes.
In short...
Word2Vec
Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.
It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.
Word2vec can find relations between words. They preserve syntactical as well as symentic information. This explains why these vectors are also useful as features for many canonical NLP prediction tasks, such as part-of-speech tagging or named entity recognition
Visualizing 300Dimensional Word2vec embeddings by reducing their dimensionality to 2Dimensions using t-SNE or PCA
How does it work?
.....and a lot more.
What is it ?
Topic Extraction, tag assignment by quora- use LSA, or mainstream information extraction techniques.
Intent classification
How is it done ?
The general pipeline for text classification
Text preprocessing
convert to lower case, remove punctuation and Tokenize
despite
a
somewhat
too
tidy
ending
it
s
a
terrific
movie
beautifully
made
despite
somewhat
tidy
ending
terrific
movie
beautifully
made
despit
somewhat
tidy
end
terrif
movi
beautiful
made
despit
somewhat
tidy
end
terrif
movi
beautiful
make
Stopword removal
Stemming
Lemmatization
Converting Preprocessed Text to Features
Count Vectorization
Doc1: hello how are you
Doc2: i am good how are you
Doc3: are you coming to college
Vocabulary
hello
coming
to
how
you
are
good
college
i
am
Vector for Doc1
1
0
0
1
1
1
0
0
0
0
Vector for Doc2
0
0
0
1
1
1
1
0
1
1
Vector for Doc3
0
1
1
0
1
1
0
1
0
0
Need for a new metric
Doc1: Tony Stark is a genius.
Doc2: Tony Stark is not a genius.
TF-IDF vecorization
(term frequency–inverse document frequency)
TF: Term Frequency: Measures how frequently a term occurs in a document.
TF(t) = (Number of times term t appears in a document) (Total number of terms in the document).
IDF: Inverse Document Frequency: Measures how important a term is.
IDF(t) = log(Total number of documents / Number of documents with term t in it)
Finally : Tf-Idf (t) = TF(t) * IDF(t)
What after feature extraction?
Limitations of the traditional approaches
The movie was good; was not bad. (+ve)
vs
The movie was not good; was bad. (-ve)
The cat walks across the table.
or
The dog jumps across the sofa.
Using RNNs for text classification
Using Convolutional Neural Networks with word embeddings
From research paper titled- UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification (Severyn and Moschitti, 2015;Felbo et al., 2017)
How do they work ?
Natural Language Generation:
With and without Deep Learning
Retrieval type chatbots
Generative type Chatbots
Gives suggestions for appropriate short responses to the received message
How Smart Reply works
Taken from a research paper by google
The Dual Encoder LSTM
Used to learn a similarity function between a query and a response. The responses are already made.
Working of Dual encoder architecture
Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the "meaning" of the context and response (c and r in the picture).
We multiply c with a matrix M to "predict" a response r'. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training.
We measure the similarity of the predicted response r' and the actual response r by taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. figure.
An end to end technique for generating responses from scratch, given a query
How it works
Seq2Seq with Attention mechanism
Attention mechanism
Seq2seq model with attention
How the context is calculated
Drawbacks of using seq2seq for text generation
The goal is to convert a given piece of text from one language to another while preserving the syntax and meaning.
Lexical ambiguity:
book used as verb: book a flight => reservar (spanish)
book used as noun: read the book => libro (spanish)
Different word orders:
English word order is: subject-verb-object
Japanese word order is: subject-object-verb
Syntactic ambiguity can cause problem:
John hit the dog with a stick. - could have two meanings
Pronoun resolution:
The computer outputs the data; it is fast.
The computer outputs the data; it is stored in ascii.
In the first sentence 'it' refers to the computer; but in second, it refers to the data
Classical approaches to MT
We have a source language: f , eg. French
And we have a destination language: e, eg. English
We use a probabilistic approach:
p(f|e) is the translational model which is trained on the parallel corpus.
And p(e) is the language model in the language 'e' (here English).
Beam search is used for searching for best sentence out of the possible choices.
Translation using Seq2Seq model with attention
This approach is similar to the one used to generate a target sentence given a source sentence. The only difference is that here instead of the target sentence being in the same language, it is in the target language.
Visualizing Attention
Creating a short, accurate, and fluent summary of a longer text document.
Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.
Using Seq2Seq model for text summarization
Visualizing Attention weights
Results with the deep learning approach
Other widely used approaches for text summarization
Extractive:
Abstractive:
A trainable method by
Kupiec et al. (1995)
For every sentence, it was decided if it is to be included or not.
The features:
Naive Bayes classifier is used, which assumes statistical independence.
It's an extractive based method.
Gong and Liu (2001)
LSTM Layers x 2
softmax
softmax
softmax
Linear
Linear
Linear
Embeddings
LSTM Layers x 2
softmax
Linear
Embeddings
LSTM
classification scores
Zini's AI server (Runs code
for the intelligence)
Programming Language: Python
Zini's android server (Handles user authentication and communication with AI server)
Programming language: ASP. NET
User's android/IOS phone running ZINI app
ZINI Server Communication architecture
ZINI high level architecture and working
User's app
Android Server
AI server (Provides the intelligence)
Intent Recognition
Dialog Manager
Sympotom assessment module
Specialist recommendation module
General Chat module
General Medical Advice & Info (GMAI) module
Additional conversation module
(handles specific chat types for eg. emotional responses)
Database
(PostgreSQL)
Response
Message
Message
i | n | t | u | i | t | i | v | e |
---|
mystring =
0 1 2 3 4 5 6 7 8
-9 -8 -7 -6 -5 -4 -3 -2 -1
index
negative index
is a > b?
Yes
No
"a is greater than b"
"a is not greater than b"
a and b are two real numbers
if
a > b
a == b
"a is greater than b"
"a is equal to b"
a < b
"a is less than b"
A | B | C | D | E | F | G | H | I | J |
---|
K | L | M | N | O | P | Q | R | S | T |
---|
U | V | W | X | Y | Z |
---|
= - log ( )
26
1
2
= - log ( )
26
1
probability (p)
2
= - log (p)
2
Self information
i = 1
n
n = No. of possible outcomes
0 | |
---|---|
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | '20m' |
8 | |
9 |
key : 'carryminati' }
value : '20m' }
Hash function
7
During Insertion
0 | |
---|---|
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | '20m' |
8 | |
9 |
key : 'carryminati' }
Hash function
7
During Lookup
'20m'
Message | Response |
---|---|
How are you? | I am fine, Thankyou! |
How to impress girls? | Be rich :P |
Universal Sentence Encoder
Vector1
Vector2
While Training
Vector1
Vector2
Universal Sentence Encoder
"How you doin?"
Vector3
cosine sim.
similarity 1
= 0.9
similarity 2
= 0.3
>
Found most similar question: "How are you?"
During chatting: