Basics

Library NLTK

import nltk
nltk.download()

Tokenize

from nltk.tokenize import sent_tokenize, word_tokenize
# Best for European languages

text = "Hey Bob! What's the weather at 8 o'clock"
sent_tokenize(text)
# ['Hey Bob!', "What's the weather at 8 o'clock"]
word_tokenize(sent_tokenize(text)[1])
# ['What', "'s", 'the', 'weather', 'at', '8', "o'clock"]

Part Of Speech Tagging

tokens = word_tokenize("I went to Paris to meet Bob")
nltk.pos_tag(tokens)
# [('I', 'PRP'),
#  ('went', 'VBD'),
#  ('to', 'TO'),
#  ('Paris', 'NNP'),
#  ('to', 'TO'),
#  ('meet', 'VB'),
#  ('Bob', 'NNP')]

nltk.ne_chunk(nltk.pos_tag(tokens), binary=True)
# Tree('S', [
#     ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'),
#     Tree('NE', [('Paris', 'NNP')]), ('to', 'TO'), ('meet', 'VB'),
#     Tree('NE', [('Bob', 'NNP')]),
# ])

POS tagger in NLTK isn't that great, if you want a good model, take a look at SyntaxNet

Stemming

Word -> Stem (non-changing portion)

# The two most used stemmers
from nltk.stem import SnowballStemmer
from nltk.stem.porter import PorterStemmer

snow = SnowballStemmer('english')

snow.stem("own") == snow.stem("owning") == snow.stem("owned")
# True

snow.stem("entities") == snow.stem("entity")
# True

Velocity: Snowball > Porter

Perf: Porter > Snowball

Lemmatisation

Word -> Lemma (dictionary form)

from nltk.stem import WordNetLemmatizer

wordnet = WordNetLemmatizer()

wordnet.lemmatize("women")
# u'woman'

wordnet.lemmatize("marketing")
# 'marketing'

wordnet.lemmatize("markets")
# u'market'

snow.stem("marketing")
# u'market'

snow.stem("markets")
# u'market'

/!\ Really slow /!\

Stop Words

from nltk.corpus import stopwords

len(stopwords.words('english'))
# 153

stopwords.words('english')[:20]
# [u'i',
#  u'me',
#  u'my',
#  u'myself',
#  u'we',
#  u'our',
#  u'ours',
#  u'ourselves',
#  u'you',
#  u'your',
#  u'yours',
#  u'yourself',
#  u'yourselves',
#  u'he',
#  u'him',
#  u'his',
#  u'himself',
#  u'she',
#  u'her',
#  u'hers']

String Metrics

Most widely used:

Levenshtein distance (+++)
- minimum number of character edits (insert, delete, substitue) to go from one word to the other
Jaro-Winkler distance
- Kinda the same but give more importance to the beginning of a word

Basic TODO

Lowercase
Normalize the punctuation; whatever your way, eg:
- after a comma, always a space
- every punctuation is turned to a space
Normalize spaces: multiple to single
Non-ASCII to ASCII: special characters, accents

Count Features

Count Vectorizer

Input: Corpus of text documents

Output: Matrix NxM with N = # of documents, M = # of unique words

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
 ]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()
# array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
#        [0, 1, 0, 1, 0, 2, 1, 0, 1],
#        [1, 0, 0, 0, 1, 0, 1, 1, 0],
#        [0, 1, 1, 1, 0, 0, 1, 0, 1]])

vectorizer.get_feature_names()
# [u'and', u'document', u'first', u'is', u'one', u'second',
# u'the', u'third', u'this']

TF-IDF

tf(w, d) * idf(w)

tf(w, d) * idf(w)

\log \frac{|D|}{|\{d \in D: w \in d\}|}

\log \frac{|D|}{|\{d \in D: w \in d\}|}

f_{w,d}

f_{w,d}

Normalization of occurrence matrix:

Frequency of a word in a document, weighted by its rarity in the corpus

tf: reward for high occurrence in a document

idf: penalty if too much appearance in the corpus

(log term because, most of the time, words distribution across a corpus is a power law)

TF-IDF

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer()
np.around(vectorizer.fit_transform(corpus).toarray(), decimals=2)

# array([[ 0.  ,  0.44,  0.54,  0.44,  0.  ,  0.  ,  0.36,  0.  ,  0.44],
#        [ 0.  ,  0.27,  0.  ,  0.27,  0.  ,  0.85,  0.22,  0.  ,  0.27],
#        [ 0.55,  0.  ,  0.  ,  0.  ,  0.55,  0.  ,  0.29,  0.55,  0.  ],
#        [ 0.  ,  0.44,  0.54,  0.44,  0.  ,  0.  ,  0.36,  0.  ,  0.44]])

vectorizer.get_feature_names()
# [u'and', u'document', u'first', u'is', u'one', u'second',
# u'the', u'third', u'this']

N-Grams

from sklearn.feature_extraction.text import CountVectorizer

text = "word1 word2 word3 word4 word5"
CountVectorizer(ngram_range=(1,4)).build_analyzer()(text)
# [u'word1',
#  u'word2',
#  u'word3',
#  u'word4',
#  u'word5',
#  u'word1 word2',
#  u'word2 word3',
#  u'word3 word4',
#  u'word4 word5',
#  u'word1 word2 word3',
#  u'word2 word3 word4',
#  u'word3 word4 word5',
#  u'word1 word2 word3 word4',
#  u'word2 word3 word4 word5']


# Do the same, just with Python
def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

Dense Vectors

Embedding

Embedding model learns to map each discrete word into a low-dimensional continuous vector-space from their distributional properties observed in some raw text corpus.

Word2Vec

Premise: words next to each other are related

Two distinct models:

CBOW (Continuous Bag of Words): given surrounding words, predict central word
SGNS (Skip Gram with Negative Sampling): given a word, predict surrounding words

Word2Vec: SGNS

Sentence: "The quick brown fox jumps over the lazy dog"

Window = 1 Negative sample = 2

Word	Context
the	quick
quick	the
quick	brown
brown	quick
brown	fox
[...]	[...]

Word	False Context
the	random_word1
the	random_word2
quick	random_word3
quick	random_word4
brown	random_word5
[...]	[...]

Positive Dataset D (label 1)

Negative Dataset D' (label 0)

Word2Vec: SGNS

Considering:
- Corpus of words w ∈ W and their context c ∈ C
- Parameters θ controlling the distribution P(D = 1|w, c; θ)

v_c \in R^d, v_w \in R^d

v_c \in R^d, v_w \in R^d

P\left(D=1\middle|w,c;\theta\right) = \frac{1}{1 + e^{-v_c.v_w}}

P\left(D=1\middle|w,c;\theta\right) = \frac{1}{1 + e^{-v_c.v_w}}

Vectorial representation of w and c:

= {arg\,max}_\theta \displaystyle\sum_{(w, c) \in D} log \frac{1}{1 + e^{-v_c.v_w}} + \displaystyle\sum_{(w, c) \in D'} log \frac{1}{1 + e^{v_c.v_w}}

= {arg\,max}_\theta \displaystyle\sum_{(w, c) \in D} log \frac{1}{1 + e^{-v_c.v_w}} + \displaystyle\sum_{(w, c) \in D'} log \frac{1}{1 + e^{v_c.v_w}}

Probability that a couple (w, c) belongs to D:

Objective:

{arg\,max}_\theta \displaystyle\prod_{(w, c) \in D} P\left(D=1\middle|w,c;\theta\right) \displaystyle\prod_{(w, c) \in D'} P\left(D=0\middle|w,c;\theta\right)

{arg\,max}_\theta \displaystyle\prod_{(w, c) \in D} P\left(D=1\middle|w,c;\theta\right) \displaystyle\prod_{(w, c) \in D'} P\left(D=0\middle|w,c;\theta\right)

Word2Vec: in pratice

Python library gensim: https://radimrehurek.com/gensim/models/word2vec.html

Main parameters:

model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
# [('queen', 0.50882536), ...]

model.wv.doesnt_match("breakfast cereal dinner lunch".split())
# 'cereal'

Usually, people use a lot of pre-trained Word2Vec models!

Most famous: https://github.com/3Top/word2vec-api
In french: http://fauconnier.github.io/#data

A lot of built-in functions:

model = Word2Vec(sentences, size=100, window=5, negative=5, alpha=0.025, min_count=10)

GloVe

Another embedding method:

Word2vec is a "predictive" model
GloVe is a "count-based" model

Count-based models:

Learn their vectors by doing some dimensionality reduction on the co-occurence counts matrix.

Always the same objective: minimize some "construction loss" when trying to find the lower-dimensional representation which can explain most of the variance in the high-dimensional data.

GloVe: quick insight

tl;dr: normalizing the counts & log-smoothing

Weighting the counts around the window:

Sentence: "word1 word2 word3 word4"

Window: 2

	word1	word2	word3	word4
word1	0	1	0.5	0
word2	1	0	1	0.5
word3	0.5	1	0	1
word4	0	0.5	1	0

GloVe: quick insight

Based on this matrix, vectors are built using:

Where Xij is the element (i,j) of the co-occurence matrix

w_i^Tw_j + b_i + b_j = log X_{i,j}

w_i^Tw_j + b_i + b_j = log X_{i,j}

\displaystyle\sum_{(i, j) = 1}^{V} g(X_{i,j}) (w_i^Tw_j + b_i + b_j - log X_{i,j})^2

\displaystyle\sum_{(i, j) = 1}^{V} g(X_{i,j}) (w_i^Tw_j + b_i + b_j - log X_{i,j})^2

Weight function g:

Cost function:

g(X_{i,j}) = \left\{\frac{X_{i,j}}{x_{max}} \text{ if }X_{i,j} < x_{max} \text{ else 1}\right\}

g(X_{i,j}) = \left\{\frac{X_{i,j}}{x_{max}} \text{ if }X_{i,j} < x_{max} \text{ else 1}\right\}

GloVe: in practice

Python library & pre-trained models:

https://github.com/stanfordnlp/GloVe

Deep Learning & NLP

Best course:

Youtube Playlist

Syllabus

Do it, do it, do it! (and do the maths)

Kaggle Quora:

How to get easily in top 50

tl;dr: Siasmese network

Using pre-trained GloVe (from stanford), feed (q1, q2) to LSTM model, concatene the resulting vectors into one, feed it to some fully connected layers

Bonus: Internship @RadiumOne

#unofficial

Looking for a data scientist intern for a full journey into the DS world:

Data Engineering:
- ~10^10 data incoming by day
- Hadoop/Spark cluster: 400-500 nodes worldwide
Data Science:
- Click/Conversion prediction
- Churn, Upsell,...
- Imbalanced learning
Business:
- Client facing
- Close relationship with traders
- Project management (training sessions,...)

Contact

Yann Carbonne

ycarbonne@radiumone.com

Slack: @yannc

=> Meetup 05/18 @Telecom <=

PREPROCESSING 102

Outline

Basics

Library NLTK

Tokenize

Part Of Speech Tagging

Stemming

Lemmatisation

Stop Words

String Metrics

Basic TODO

Count Features

Count Vectorizer

TF-IDF

TF-IDF

N-Grams

Dense Vectors

Embedding

Word2Vec

Word2Vec: SGNS

Word2Vec: SGNS

Word2Vec: in pratice

GloVe

GloVe: quick insight

GloVe: quick insight

GloVe: in practice

Deep Learning & NLP

Best course:

Kaggle Quora:

Bonus: Internship @RadiumOne

Contact