Basics

Library NLTK

import nltk
nltk.download()

Tokenize

from nltk.tokenize import sent_tokenize, word_tokenize
# Best for European languages

text = "Hey Bob! What's the weather at 8 o'clock"
sent_tokenize(text)
# ['Hey Bob!', "What's the weather at 8 o'clock"]
word_tokenize(sent_tokenize(text)[1])
# ['What', "'s", 'the', 'weather', 'at', '8', "o'clock"]

Part Of Speech Tagging

tokens = word_tokenize("I went to Paris to meet Bob")
nltk.pos_tag(tokens)
# [('I', 'PRP'),
#  ('went', 'VBD'),
#  ('to', 'TO'),
#  ('Paris', 'NNP'),
#  ('to', 'TO'),
#  ('meet', 'VB'),
#  ('Bob', 'NNP')]

nltk.ne_chunk(nltk.pos_tag(tokens), binary=True)
# Tree('S', [
#     ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'),
#     Tree('NE', [('Paris', 'NNP')]), ('to', 'TO'), ('meet', 'VB'),
#     Tree('NE', [('Bob', 'NNP')]),
# ])

POS tagger in NLTK isn't that great, if you want a good model, take a look at SyntaxNet

Stemming

Word -> Stem (non-changing portion)

# The two most used stemmers
from nltk.stem import SnowballStemmer
from nltk.stem.porter import PorterStemmer

snow = SnowballStemmer('english')

snow.stem("own") == snow.stem("owning") == snow.stem("owned")
# True

snow.stem("entities") == snow.stem("entity")
# True

Velocity: Snowball > Porter

Perf: Porter > Snowball

Lemmatisation

Word -> Lemma (dictionary form)

from nltk.stem import WordNetLemmatizer

wordnet = WordNetLemmatizer()

wordnet.lemmatize("women")
# u'woman'

wordnet.lemmatize("marketing")
# 'marketing'

wordnet.lemmatize("markets")
# u'market'

snow.stem("marketing")
# u'market'

snow.stem("markets")
# u'market'

/!\ Really slow /!\

Stop Words

from nltk.corpus import stopwords

len(stopwords.words('english'))
# 153

stopwords.words('english')[:20]
# [u'i',
#  u'me',
#  u'my',
#  u'myself',
#  u'we',
#  u'our',
#  u'ours',
#  u'ourselves',
#  u'you',
#  u'your',
#  u'yours',
#  u'yourself',
#  u'yourselves',
#  u'he',
#  u'him',
#  u'his',
#  u'himself',
#  u'she',
#  u'her',
#  u'hers']

String Metrics

Most widely used:

Levenshtein distance (+++)
- minimum number of character edits (insert, delete, substitue) to go from one word to the other
Jaro-Winkler distance
- Kinda the same but give more importance to the beginning of a word

Basic TODO

Lowercase
Normalize the punctuation; whatever your way, eg:
- after a comma, always a space
- every punctuation is turned to a space
Normalize spaces: multiple to single
Non-ASCII to ASCII: special characters, accents

Count Features

Count Vectorizer

Input: Corpus of text documents

Output: Matrix NxM with N = # of documents, M = # of unique words

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
 ]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()
# array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
#        [0, 1, 0, 1, 0, 2, 1, 0, 1],
#        [1, 0, 0, 0, 1, 0, 1, 1, 0],
#        [0, 1, 1, 1, 0, 0, 1, 0, 1]])

vectorizer.get_feature_names()
# [u'and', u'document', u'first', u'is', u'one', u'second',
# u'the', u'third', u'this']

TF-IDF

tf(w, d) * idf(w)

tf(w, d) * idf(w)

\log \frac{|D|}{|\{d \in D: w \in d\}|}

\log \frac{|D|}{|\{d \in D: w \in d\}|}

f_{w,d}

f_{w,d}

Normalization of occurrence matrix:

Frequency of a word in a document, weighted by its rarity in the corpus

tf: reward for high occurrence in a document

idf: penalty if too much appearance in the corpus

(log term because, most of the time, words distribution across a corpus is a power law)

TF-IDF

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer()
np.around(vectorizer.fit_transform(corpus).toarray(), decimals=2)

# array([[ 0.  ,  0.44,  0.54,  0.44,  0.  ,  0.  ,  0.36,  0.  ,  0.44],
#        [ 0.  ,  0.27,  0.  ,  0.27,  0.  ,  0.85,  0.22,  0.  ,  0.27],
#        [ 0.55,  0.  ,  0.  ,  0.  ,  0.55,  0.  ,  0.29,  0.55,  0.  ],
#        [ 0.  ,  0.44,  0.54,  0.44,  0.  ,  0.  ,  0.36,  0.  ,  0.44]])

vectorizer.get_feature_names()
# [u'and', u'document', u'first', u'is', u'one', u'second',
# u'the', u'third', u'this']

N-Grams

from sklearn.feature_extraction.text import CountVectorizer

text = "word1 word2 word3 word4 word5"
CountVectorizer(ngram_range=(1,4)).build_analyzer()(text)
# [u'word1',
#  u'word2',
#  u'word3',
#  u'word4',
#  u'word5',
#  u'word1 word2',
#  u'word2 word3',
#  u'word3 word4',
#  u'word4 word5',
#  u'word1 word2 word3',
#  u'word2 word3 word4',
#  u'word3 word4 word5',
#  u'word1 word2 word3 word4',
#  u'word2 word3 word4 word5']


# Do the same, just with Python
def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

Dense Vectors

Embedding

Embedding model learns to map each discrete word into a low-dimensional continuous vector-space from their distributional properties observed in some raw text corpus.

Word2Vec

Premise: words next to each other are related

Two distinct models:

CBOW (Continuous Bag of Words): given surrounding words, predict central word
SGNS (Skip Gram with Negative Sampling): given a word, predict surrounding words

Word2Vec: SGNS

Sentence: "The quick brown fox jumps over the lazy dog"

Window = 1 Negative sample = 2

Word	Context
the	quick
quick	the
quick	brown
brown	quick
brown	fox
[...]	[...]

Word	False Context
the	random_word1
the	random_word2
quick	random_word3
quick	random_word4
brown	random_word5
[...]	[...]

Positive Dataset D (label 1)

Negative Dataset D' (label 0)

Word2Vec: SGNS

Considering:
- Corpus of words w ∈ W and their context c ∈ C
- Parameters θ controlling the distribution P(D = 1|w, c; θ)

v_c \in R^d, v_w \in R^d

v_c \in R^d, v_w \in R^d

P\left(D=1\middle|w,c;\theta\right) = \frac{1}{1 + e^{-v_c.v_w}}

P\left(D=1\middle|w,c;\theta\right) = \frac{1}{1 + e^{-v_c.v_w}}

Vectorial representation of w and c:

= {arg\,max}_\theta \displaystyle\sum_{(w, c) \in D} log \frac{1}{1 + e^{-v_c.v_w}} + \displaystyle\sum_{(w, c) \in D'} log \frac{1}{1 + e^{v_c.v_w}}

= {arg\,max}_\theta \displaystyle\sum_{(w, c) \in D} log \frac{1}{1 + e^{-v_c.v_w}} + \displaystyle\sum_{(w, c) \in D'} log \frac{1}{1 + e^{v_c.v_w}}

Probability that a couple (w, c) belongs to D:

Objective:

{arg\,max}_\theta \displaystyle\prod_{(w, c) \in D} P\left(D=1\middle|w,c;\theta\right) \displaystyle\prod_{(w, c) \in D'} P\left(D=0\middle|w,c;\theta\right)

{arg\,max}_\theta \displaystyle\prod_{(w, c) \in D} P\left(D=1\middle|w,c;\theta\right) \displaystyle\prod_{(w, c) \in D'} P\left(D=0\middle|w,c;\theta\right)

Word2Vec: in pratice

Python library gensim: https://radimrehurek.com/gensim/models/word2vec.html

Main parameters:

model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
# [('queen', 0.50882536), ...]

model.wv.doesnt_match("breakfast cereal dinner lunch".split())
# 'cereal'

Usually, people use a lot of pre-trained Word2Vec models!

Most famous: https://github.com/3Top/word2vec-api
In french: http://fauconnier.github.io/#data

A lot of built-in functions:

model = Word2Vec(sentences, size=100, window=5, negative=5, alpha=0.025, min_count=10)

GloVe

Another embedding method:

Word2vec is a "predictive" model
GloVe is a "count-based" model

Count-based models:

Learn their vectors by doing some dimensionality reduction on the co-occurence counts matrix.

Always the same objective: minimize some "construction loss" when trying to find the lower-dimensional representation which can explain most of the variance in the high-dimensional data.

GloVe: quick insight

tl;dr: normalizing the counts & log-smoothing

Weighting the counts around the window:

Sentence: "word1 word2 word3 word4"

Window: 2

	word1	word2	word3	word4
word1	0	1	0.5	0
word2	1	0	1	0.5
word3	0.5	1	0	1
word4	0	0.5	1	0

GloVe: quick insight

Based on this matrix, vectors are built using:

Where Xij is the element (i,j) of the co-occurence matrix

w_i^Tw_j + b_i + b_j = log X_{i,j}

w_i^Tw_j + b_i + b_j = log X_{i,j}

\displaystyle\sum_{(i, j) = 1}^{V} g(X_{i,j}) (w_i^Tw_j + b_i + b_j - log X_{i,j})^2

\displaystyle\sum_{(i, j) = 1}^{V} g(X_{i,j}) (w_i^Tw_j + b_i + b_j - log X_{i,j})^2

Weight function g:

Cost function:

g(X_{i,j}) = \left\{\frac{X_{i,j}}{x_{max}} \text{ if }X_{i,j} < x_{max} \text{ else 1}\right\}

g(X_{i,j}) = \left\{\frac{X_{i,j}}{x_{max}} \text{ if }X_{i,j} < x_{max} \text{ else 1}\right\}

GloVe: in practice

Python library & pre-trained models:

https://github.com/stanfordnlp/GloVe

Deep Learning & NLP

Best course:

Youtube Playlist

Syllabus

Do it, do it, do it! (and do the maths)

Kaggle Quora:

How to get easily in top 50

tl;dr: Siasmese network

Using pre-trained GloVe (from stanford), feed (q1, q2) to LSTM model, concatene the resulting vectors into one, feed it to some fully connected layers

Bonus: Internship @RadiumOne

#unofficial

Looking for a data scientist intern for a full journey into the DS world:

Data Engineering:
- ~10^10 data incoming by day
- Hadoop/Spark cluster: 400-500 nodes worldwide
Data Science:
- Click/Conversion prediction
- Churn, Upsell,...
- Imbalanced learning
Business:
- Client facing
- Close relationship with traders
- Project management (training sessions,...)

Contact

Yann Carbonne

ycarbonne@radiumone.com

Slack: @yannc

=> Meetup 05/18 @Telecom <=

PREPROCESSING 102

Outline

Basics

Library NLTK

Tokenize

Part Of Speech Tagging

Stemming

Lemmatisation

Stop Words

String Metrics

Basic TODO

Count Features

Count Vectorizer

TF-IDF

TF-IDF

N-Grams

Dense Vectors

Embedding

Word2Vec

Word2Vec: SGNS

Word2Vec: SGNS

Word2Vec: in pratice

GloVe

GloVe: quick insight

GloVe: quick insight

GloVe: in practice

Deep Learning & NLP

Best course:

Kaggle Quora:

Bonus: Internship @RadiumOne

Contact

preprocessing 102

preprocessing 102

ycarbonne

PREPROCESSING 102

Outline

Basics

Library NLTK

Tokenize

Part Of Speech Tagging

Stemming

Lemmatisation

Stop Words

String Metrics

Basic TODO

Count Features

Count Vectorizer

TF-IDF

TF-IDF

N-Grams

Dense Vectors

Embedding

Word2Vec

Word2Vec: SGNS

Word2Vec: SGNS

Word2Vec: in pratice

GloVe

GloVe: quick insight

GloVe: quick insight

GloVe: in practice

Deep Learning & NLP

Best course:

Kaggle Quora:

Bonus: Internship @RadiumOne

Contact

preprocessing 102

More from ycarbonne