Deep Learning and its applications in NLP

Yugal Sharma

https://yugrocks.github.io/

Linkedin: linkedin.com/in/yugal-sharma-62855713a/

Outline

What is Deep Learning (Overview)
Introduction to Natural Language Processing (NLP)
- Basic Components: NLU and NLG
- POS tagging
- Information Extraction
- Language Modeling
- Word Embeddings : Word2Vec
Some applications of NLP - (And how Deep Learning has invaded NLP)
- Text Classification
- Conversational Interfaces (NLU + NLG)
- Machine Translation
- Text summarization

What is Deep Learning

It is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called artificial neural networks.
The word "Deep" describes the development of large artificial neural networks.

Why Deep Learning

As we construct larger neural networks and train them with more and more data, their performance continues to increase. This in contrast to other Machine Learning techniques in which the performance reaches a plateau.

Slide by andrew Ng

This picture summarizes the main difference between ML and Deep Learning

What is an Artificial Neural Network

Neural Networks are powerful learning algorithms that are loosely inspired by the brain.
They are capable of learning complex, hidden representations of the input data.
Simply put- they take inputs, perform a complex computation and spit out the required prediction about the input data.
A typical neural network has anything from a few dozen to hundreds, thousands, or even millions of artificial neurons also called units arranged in a series of layers, each of which connects to the layers on either side.
Each neuron in a layer:
1. Takes the inputs from previous layer.
2. Calculates the weighted summation of the input.
3. Applies some function (like sigmoid) to the weighted sum.
4. Progresses the calculated value to the next layer of neurons

Computation inside a single unit(neuron)

A typical neural network, as an arrangement of units in different layers

How do Neural Networks learn

This is done using Gradient Descent algorithm with Back-propagation.

Convolutional Neural Network(CNN)

A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers and normalization layers.
Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The convolution emulates the response of an individual neuron to visual stimuli.

Convolution operation by a single filter

Convutional Neural Networks are widely used in computer vision and NLP

Fig1. Using Conv Nets for image classification

Fig2. Using Conv Nets for text classification

Recurrent Neural Networks

Along with the present input, the hidden layer activations from the previous time steps are also fed into the neural network.
This enables it to consider all previously given inputs, along with the current input while making a decision
This is why they are used extensively in NLP, where sequence classification is required.

Enhancements in the vanilla RNN architecture

The vanilla RNN suffers from the problem of vanishing gradients while training, which doesn't let it learn long term dependencies on previous inputs.
There are two modified versions of the simple RNN which suffer much less from this problem, and are capable to remember longer sequences.
- Long Short Term Memory network (LSTM)
- Gated Recurrent Unit (GRU)

LSTM Architecture

Inside a single LSTM cell

The same cell takes different input at each time step, carries out its computation, produces an output and then uses the newly computed hidden state for next time step

Gated Recurrent Unit (GRU)

GRU can be considered as a variation on the LSTM, as both are designed similarly and, in some cases, produce equally excellent results.

Natural Language Processing

Natural Language Processing is a field that aims to make computers understand and manipulate human language, so we can interact with it more easily.

Examples of some successful NLP systems

The conversational interfaces like Alexa, Siri, Cortana, Google Assistant etc. leverage the power of NLP to interact with their users and fulfil their certain wants

Components of NLP

Natural Language Understanding(NLU)

Mapping the given input in natural language into useful representations

Natural Language Generation (NLG)

Producing meaningful phrazes and sentences in the form of natural language from some internal representation

Why is NLU difficult?

The human languages are complex
We use the language creatively. There are a lot of ambiguities.
Language understanding needs contextual and general knowledge apart from linguistic knowledge.

How Deep Learning simplified NLP

We have moved on, from rule based and statistical models which required lot of preprocessing and modeling, to fully end to end and more accurate methods, thanks to Deep Learning.

POS tagging

The process of classifying words into their parts-of-speech and labeling them accordingly is known as Part-of-Speech tagging, POS tagging, or simply tagging.

Text

Some examples of POS tagging

My name is Tony Stark and I am Iron Man

Pronoun (PRP)

Noun (NN)

Verb (VBZ)

Noun (NN)

Conjunction (CC)

Pronoun

Verb

Noun

Mary is reading a book

Noun (NN)

Verb (VBZ)

Verb (VBG)

Noun (NN)

Determiner (DT)

Book a flight for me

Verb (VB)

Determiner (DT)

Noun (NN)

Preposition (PRP)

Pronoun (PRP)

HOW?

Rule Based Methods
Transformation based
Traditional Machine Learning methods
Using end-to-end Deep Learning

We will explore the last two methods in this talk.

POS Tagging using Machine Learning

Treat it just like any other Supervised Learning classification problem.
Take a tagged corpus, extract features, choose a classifier to train and evaluate.
In practice, the CRF (Conditional Random Fields), Decision tree, naive bayes classifiers have shown good results for this task.

Sources of Information

(Features)

Knowledge about individual words
- Lexical info
- Spelling (Ex. words ending in 'ing' are likely to be verbs with gerund)
- Capitalization (ex. IBM)
Knowledge about neighbouring words
- n words before it and n words after.
- pos tags of previous n words

HMM Tagging

The Deep Learning based Approaches

There are a lot of approaches which have been successful. Like-

Sequence labeling using Bidirectional LSTM
Ensamble of Bidirectional LSTM and CRF
Vanilla Neural Network with word embeddings

Using Bidirectional LSTM

While using a Bidirectional LSTM, we do not need to consider explicitly the window of previous and the next n words.
Bidirectional LSTM is a sequence model, that considers the input series in a two way manner. First it sweeps forward the input and then encodes the same input in the backward direction.
The input: A series of words converted to their -one-hot-encoding or word embeddings (which we will discuss later about). The length of sentences is fixed to a max value (say 50). The sentences shorter than 50 words are padded to zero vectors.
Output: At each time step, a k-length vector is output which contains the probabilities of the word belonging to one of the k classes.

LSTM

Contextual representation

To Fully connected Layer

the

Word representation

(embeddings or one hot)

dog

jumped

Using character level features with the words

In order to make the model consider the spelling based features like:

Case of the whole word
Case of the first letter of word
Suffixes (-ed, -ing) etc.

we can embed character level features as well, by concatenating the character representation vector with the input word vectors

From a paper by Peilu Wang, Yao Qian, Frank K. Soong, Lei He, Hai Zhao,

Part-of-Speech Tagging with Bidirectional Long Short-Term Memory
Recurrent Neural Network

Using character representations along with word representations

How to compute Character level features?

One option is to use hand-crafted features, like a component with a 0 or 1 if the word starts with a capital for instance.
Another fancier option is to use some kind of neural network to make this extraction automatically for us. Say a bi-LSTM at the character level.
Each character $c_{i}$ of a word $w = [c_{1}, \dots, c p]$ is converted to a vector. We run a bi-LSTM over the sequence of character embeddings and concatenate the final states to obtain a fixed-size vector $w_{c h a r s} \in R d_{2}$ . Intuitively, this vector captures the morphology of the word.

(The diagramatic representation is in the next slide)

This is how we compute character level features for a word

Using ensamble - Bidirectional LSTM+CRF

In the output layer, a linear chain CRF can be used instead of a fully connected softmax layer.

Some results with the deep learning based methods

Huang, Zhiheng et al. “Bidirectional LSTM-CRF Models for Sequence Tagging.” CoRR abs/1508.01991 (2015): n. pag.

Information Extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.
Information Extraction has many applications, including business intelligence, resume harvesting, media analysis, sentiment detection, patent search, and email scanning.
Tasks like Named-entity-recognition, relation extraction, co-reference resolution are some examples of information extraction from unstructured text.

Named-Entity Recognition

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities.
Named entities are definite noun phrases that refer to specific types of individuals, such as organizations,
persons, dates, and so on.
The same technique can be used in medical domain to identify diseases, symptoms, medications, tests etc. from text.

Some examples of NER

Mr Wayne is the owner of the Wayne Enterprises

PERSON

ORGANIZATION

I have a meeting with Mr Bill Gates at California

PERSON

LOCATION

I was born in August 1996

DATE

The Clinical Named Entity System (CliNER)

Clinical Named Entity Recognition system (CliNER) is an open-source natural language processing system for named entity recognition in clinical text of electronic health records. CliNER is designed to follow best practices in clinical concept extraction. (http://text-machine.cs.uml.edu/cliner/index.html)

What it does

CliNER architecture

IOB format

The IOB format (short for inside, outside, beginning) is a common tagging format for tagging tokens in a chunking task in computational linguistics, (ex. NER)
The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk.
The B- tag is used only when a tag is followed by a tag of the same type without O tokens between them.
An O tag indicates that a token belongs to no chunk.

IOB examples

Mr Wayne is the owner of the Wayne Enterprises

I have a meeting with Mr Bill Gates at California

B-PERSON

B-ORG

I-ORG

B-PER

I-PER

B-LOCATION

How to do NER?

The techniques for carrying out Named Entity recognition is very similar to that of POS tagging, since both are sequence tagging problems. So all the methods which can be used for POS tagging, can also be used for NER.
Some minor changes can be like:
- NER involves different tags which are in the IOB format
- This problem can also be broken down into detection of corresponding IOB tags, followed by classification of the tags into the corresponding categories. (like I-PER, B-LOC)
- While using the traditional Machine Learning approaches, the POS tags of all the words of the sentences can be first determined and passed as input features for the NER classifier.
- However, while using Deep Learning for NER, this is not required. (The next slide)

End-to-End NER with LSTM

(IO encoding)

Language Modeling

The goal of Language Modeling is to build a language model that can estimate the distribution of natural language as accurate as possible.

What is a language model?

A language model computes a probability for a sequence of words:

P(w_1,w_2,....,w_t)

Language models are used in

Machine Translation
- Word Ordering:
  p(the puppy is cute) > p(cute is the puppy)
- Word choice:
  p(i am going home) > p(i am going house)
Speech recognition
Handwriting recognition
Parsing etc.

Traditional language models

Compute a probability distribution over n grams (every n sequence of words) from a large corpus.

There is an incorrect but necessary Markov assumption involved, that the probability of a word Wi, given previous series of words W1, W2,....,W(i-1) should approximately be equal to the probability of Wi, given only n previous words.

So, to estimate probabilities, we generally count for unigrams(only the current word), Bigrams(conditioning on previous word) or Trigrams (conditioning on previous two words).
This is done to reduce the intensive computation involved with higher order word sequences.

Bigrams

Trigrams

Limitations to the above technique

Gigantic RAM requirements, which make using more than 3-4 gram infeasible.
This makes them impossible to be stored on devices with low RAM, like phones.
There can be as little as zero occurences of a given sequence in the training corpus, which can make probabilities to absolute zero after being multiplied. (Smoothing is done to deal with this)

Language Modeling using Recurrent Neural Networks

LSTMs are good at memorizing long sequences, and so there is no need to compute the n-grams explicitly.
The RAM requirement only scales with the number of total words.
Using word embeddings can make the model more accurate.

Given a list of word vectors: x1, x2,......,xT

At a single time step,

h_t = LSTM(x_t, h_{t-1})

\hat{y_t} = softmax(W * h_t)

\hat{P}(x_{t+1} = v_j | x_t,...,x_1) \hat{y_{t,j}}

LSTM

Word representation

(embeddings or one hot)

x_{t-1}

x_{t}

x_{t+1}

h_{t-1}

h_{t-2}

h_{t-1}

y_{t-1}

h_{t}

y_{t}

h_{t+1}

y_{t+1}

softmax

\hat{y}

Here,

total.

is the probability distribution over the vocabulary of V words in

The cross entropy loss function is used, but we are predicting words instead of classes.

In short...

Word Embeddings

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.
This has multiple advantages over the traditional one-hot-encoding of words.

Word2Vec

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text.

It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model.

Word2vec can find relations between words. They preserve syntactical as well as symentic information. This explains why these vectors are also useful as features for many canonical NLP prediction tasks, such as part-of-speech tagging or named entity recognition

Visualizing 300Dimensional Word2vec embeddings by reducing their dimensionality to 2Dimensions using t-SNE or PCA

3D visualization with Tensorboard

How does it work?

We create a Neural Network with one hidden layer that aims to predict between a center word w_t and context words, in terms of word vectors.
After many iterations over a large corpus of unstructured text, when finally the training loss stops decreasing, we stop the training and then the weight matrix of the hidden layer becomes our embeddings.

We can generate as many dimensional embedding as we want, by adjusting the hidden layer size.
Usually the more the dimensions, the more information it conveys.
In the picture on the previous slide, the size of hidden layer is 300, so each word in our word embeddings will be of 300 length.

Text Classification

Document classification is assigning categories/labels to documents, or a string of text which could be web pages, media articles, customer reviwes etc.
Has applications like
- Spam filtering
- Sentiment Analysis
- Intent recognition (chatbots)
- Personal twitter filter (will interest me/ won't)
- Language guessing

.....and a lot more.

What is it ?

Topic Extraction, tag assignment by quora- use LSA, or mainstream information extraction techniques.

Intent classification

How is it done ?

There are a wide variety of methods for text classification, ranging from traditional TF-IDF features and Linear SVMs, to word embeddings and deep learning based classification techniques.
There are also clustering methods for unsupervised data and finding similarity between documents.
As we'll see, sometimes it is necessary to take into consideration the sequence of words or characters. That's where RNNs come into play.

The general pipeline for text classification

Text preprocessing

"Despite a somewhat too tidy ending, it's a terrific movie, beautifully made."

convert to lower case, remove punctuation and Tokenize

despite

somewhat

too

tidy

ending

terrific

movie

beautifully

made

despite

somewhat

tidy

ending

terrific

movie

beautifully

made

despit

somewhat

tidy

end

terrif

movi

beautiful

made

despit

somewhat

tidy

end

terrif

movi

beautiful

make

Stopword removal

Stemming

Lemmatization

Converting Preprocessed Text to Features

The documents can be represented as fixed length vectors. This idea is known as the vector space model.
- Count vectorization
- TF-IDF vectorization

Count Vectorization

Doc1: hello how are you

Doc2: i am good how are you

Doc3: are you coming to college

Vocabulary

hello

coming

how

you

are

good

college

Vector for Doc1

Vector for Doc2

Vector for Doc3

Need for a new metric

Sometimes certain stopwords can be useful.
Words do not have the same weightage for each document

Doc1: Tony Stark is a genius.

Doc2: Tony Stark is not a genius.

TF-IDF vecorization

(term frequency–inverse document frequency)

TF: Term Frequency: Measures how frequently a term occurs in a document.

TF(t) = (Number of times term t appears in a document) (Total number of terms in the document).

IDF: Inverse Document Frequency: Measures how important a term is.

IDF(t) = log(Total number of documents / Number of documents with term t in it)

Finally : Tf-Idf (t) = TF(t) * IDF(t)

What after feature extraction?

Once the feature vector has been made, classic supervised learning algorithms like SVM, Naive Bayes, Logistic regression can be used for classification.
Generally for short length documents (ex. for intent classfication for short sentences) naive bayes has shown good results because of its ability to generalize better.

Limitations of the traditional approaches

Don't take sequence into account
Consider the assumption that words are independent of each other
Don't generalize to words which were not seen during training

The movie was good; was not bad. (+ve)

The movie was not good; was bad. (-ve)

The cat walks across the table.

The dog jumps across the sofa.

Using RNNs for text classification

Using a many-to-one RNN(LSTM) for text classification.
AT each time step, a word from the sentence will be passed to LSTM as input in form of either one hot encoding or word embeddings

Using Convolutional Neural Networks with word embeddings

From research paper titled- UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification (Severyn and Moschitti, 2015;Felbo et al., 2017)

Conversational Interfaces (chatbots)

Purpose of a conversational interface

These are made to serve certain purposes fo example, for booking a hotel, ordering a pizza, entertainment etc.
Now a days, chatbots are also being used for emotional counselling of people.

How do they work ?

Natural Language Generation:

With and without Deep Learning

Retrieval type chatbots

Responses are handmade by programmers
The bot chooses the best possible response, given a query, with the help of some heuristic (such as tf-idf and cosine similarity)
Can use deep Lerning methods like Dual-encoder similarity.

Generative type Chatbots

Generates responses from scratch, given a query.
Rule based systems were made in the old times.
Now uses the popular Seq2Seq architecture.

"Smart Reply" by Google

Gives suggestions for appropriate short responses to the received message

How Smart Reply works

Taken from a research paper by google

The Dual Encoder LSTM

Used to learn a similarity function between a query and a response. The responses are already made.

Working of Dual encoder architecture

Both the context and the response text are split by words, and each word is embedded into a vector.
Both the embedded context and response are fed into the same Recurrent Neural Network word-by-word. The RNN generates a vector representation that, loosely speaking, captures the "meaning" of the context and response (c and r in the picture).
We multiply c with a matrix M to "predict" a response r'. If c is a 256-dimensional vector, then M is a 256×256 dimensional matrix, and the result is another 256-dimensional vector, which we can interpret as a generated response. The matrix M is learned during training.
We measure the similarity of the predicted response r' and the actual response r by taking the dot product of these two vectors. A large dot product means the vectors are similar and that the response should receive a high score. figure.

The Sequence to Sequence (Seq2seq) model

An end to end technique for generating responses from scratch, given a query

How it works

It consists of two RNNs : An Encoder and a Decoder.
The encoder takes a sequence(sentence) as input and processes one symbol(word) at each timestep.
Its objective is to convert a sequence of symbols into a fixed size feature vector (called context) that encodes only the important information in the sequence while losing the unnecessary information.
From the context, the decoder generates another sequence, one symbol(word) at a time.

Seq2Seq with Attention mechanism

Attention is a mechanism that forces the model to learn to focus (or to attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder’s LSTM.
The decoder does not take the context computed by the encoder. It instead takes the context generated by applying attention mechanism over all the encoder hidden state outputs.

Attention mechanism

Seq2seq model with attention

How the context is calculated

Drawbacks of using seq2seq for text generation

This model can take the conversation in any direction, which is not controllable.
Grammatical mistakes are often there.
May generate replies which are inconsistent with the previous replies.

Machine Translation

The goal is to convert a given piece of text from one language to another while preserving the syntax and meaning.

Challanges in MT

Lexical ambiguity:

book used as verb: book a flight => reservar (spanish)

book used as noun: read the book => libro (spanish)

Different word orders:

English word order is: subject-verb-object

Japanese word order is: subject-object-verb

Syntactic ambiguity can cause problem:

John hit the dog with a stick. - could have two meanings

Pronoun resolution:

The computer outputs the data; it is fast.

The computer outputs the data; it is stored in ascii.

In the first sentence 'it' refers to the computer; but in second, it refers to the data

Classical approaches to MT

Rule based: These were called Direct MT systems. They convert it word by word. And then applied some rules to preserve syntactical structure.
Transfer based approach: Parses the source language sentence, convert the parse tree to the target language parse tree, and finally generate the output sentence from that tree.
Statistical MT systems: Use parallel corpora. Very complex to build

Statistical MT systems

We have a source language: f , eg. French

And we have a destination language: e, eg. English

We use a probabilistic approach:

\hat{e}=argmax_ep(e|f) = argmax_ep(f|e)p(e)

p(f|e) is the translational model which is trained on the parallel corpus.

And p(e) is the language model in the language 'e' (here English).

Beam search is used for searching for best sentence out of the possible choices.

Translation using Seq2Seq model with attention

This approach is similar to the one used to generate a target sentence given a source sentence. The only difference is that here instead of the target sentence being in the same language, it is in the target language.

Visualizing Attention

Text Summarization

Creating a short, accurate, and fluent summary of a longer text document.

Automatic text summarization methods are greatly needed to address the ever-growing amount of text data available online to both better help discover relevant information and to consume relevant information faster.

Using Seq2Seq model for text summarization

The same approach which was used for machine translation using seq2seq can be used for text summarization.
As the input we provide the piece of text word by word (as embeddings) and encode it into a fixed length vector.
Then the decoder decodes this vector to produce the summary (or headline) word by word.
It is trained in a similar way on a huge dataset of text-summary pairs.
Attention mechanism can be applied here. In practice, it has shown good results.

Visualizing Attention weights

Results with the deep learning approach

Other widely used approaches for text summarization

Extractive:

Based on statistical foundation
Looks at features like Keywords, title ,Word distance, Cue phrazes(for ex. significantly, hardly etc.), sentence position

Abstractive:

Based on linguistic and mathematical foundation.
Techniques like Lexical chains, Clustering, Non Negative Matrix factorization

A trainable method by

Kupiec et al. (1995)

For every sentence, it was decided if it is to be included or not.

The features:

Sentence Length
Presence of Upper case words (except common acronyms)
Thematic words
A set of 26 manually fixed phrazes
Sentence position in the paragraph

Naive Bayes classifier is used, which assumes statistical independence.

It's an extractive based method.

Gong and Liu (2001)

Using LSA (Latent Semantic Analysis)
It could be for single and multi document
No wordNet used
Each document is represented as a word by sentence matrix (row=word, column=sentence)
TF*IDF weights in the matrix
Do SVD(Singular Value Decomposition), and get USV^T
The rows of V^T are independent topics
Select sentences that cover these independent topics

Other Common Approaches

Using LSA(Latent Semantic Analysis
Using Lexical Chains

Comparison

The Deep Learning based approach is new and is really easy and less complicated as compared to the statistical and linguistic approaches to this problem.
But there are some limitations of the Deep Learning approach. That it's essentially supervised and needs a lot of data to train on. Also it can't reliably generate summaries of large paragraphs.

LSTM Layers x 2

"i"

"have"

"headache"

h_{1}

"have"

h_{2}

"headache"

h_{3}

"eos"

softmax

Linear

Embeddings

LSTM Layers x 2

"i"

"have"

"fever"

h_{1}

h_{2}

h_{3}

softmax

Linear

Embeddings

LSTM

classification scores

Zini's AI server (Runs code

for the intelligence)

Programming Language: Python

Zini's android server (Handles user authentication and communication with AI server)

Programming language: ASP. NET

User's android/IOS phone running ZINI app

ZINI Server Communication architecture

ZINI high level architecture and working

User's app

Android Server

AI server (Provides the intelligence)

Intent Recognition

Dialog Manager

Sympotom assessment module

Specialist recommendation module

General Chat module

General Medical Advice & Info (GMAI) module

Additional conversation module

(handles specific chat types for eg. emotional responses)

Database

(PostgreSQL)

Response

Message

i	n	t	u	i	t	i	v	e

mystring =

0 1 2 3 4 5 6 7 8

-9 -8 -7 -6 -5 -4 -3 -2 -1

index

negative index

is a > b?

Yes

"a is greater than b"

"a is not greater than b"

a and b are two real numbers

a > b

a == b

"a is greater than b"

"a is equal to b"

a < b

"a is less than b"

A	B	C	D	E	F	G	H	I	J

K	L	M	N	O	P	Q	R	S	T

U	V	W	X	Y	Z

= - log ( )

probability (p)

= - log (p)

Self information

cross entropy = - Σ p_ilog(q_i)

i = 1

n = No. of possible outcomes

entropy = - (p_blog(p_b) + p_y log(p_y))

= - ( 1log(1) + 0log(0))

entropy = - ( 3/4 log(3/4) + 1/4 log(1/4))

= 0.81

entropy = - ( 1/2 log_2(1/2) + 1/2 log_2(1/2))

= 1

BinaryCrossEntropy = - ( p_1 log_2(q_1) + p_2 log_2(q_2))

BinaryCrossEntropy = - ( p_1 log_2(q_1) + (1-p_1) log_2(1-q_1))

0
1
2
3
4
5
6
7	'20m'
8
9

key : 'carryminati' }

value : '20m' }

Hash function

During Insertion

0
1
2
3
4
5
6
7	'20m'
8
9

key : 'carryminati' }

Hash function

During Lookup

'20m'

Message	Response
How are you?	I am fine, Thankyou!
How to impress girls?	Be rich :P

Universal Sentence Encoder

Vector1

Vector2

While Training

Vector1

Vector2

Universal Sentence Encoder

"How you doin?"

Vector3

cosine sim.

similarity 1

= 0.9

similarity 2

= 0.3

Found most similar question: "How are you?"

During chatting:

deck

By Yugal Sharma

Deep Learning and its applications in NLP

Outline

What is Deep Learning

Why Deep Learning

What is an Artificial Neural Network

How do Neural Networks learn

Convolutional Neural Network(CNN)

Natural Language Processing

Why is NLU difficult?

POS tagging

HOW?

POS Tagging using Machine Learning

Sources of Information

(Features)

HMM Tagging

The Deep Learning based Approaches

Information Extraction

Named-Entity Recognition

IOB format

End-to-End NER with LSTM

Language Modeling

What is a language model?

Language models are used in

Word Embeddings

3D visualization with Tensorboard

Text Classification

Conversational Interfaces (chatbots)

Purpose of a conversational interface

"Smart Reply" by Google

The Sequence to Sequence (Seq2seq) model

Machine Translation

Challanges in MT

Statistical MT systems

Text Summarization

Other Common Approaches

Comparison

deck

More from Yugal Sharma