How to setup an nlp analysis
in 15 min?
Use case: a Trello board
Antoine Toubhans, Paris NLP Meetup #6, July 26, 2017
Front-end
Back-end
Trello API
My server
browser
within Trello world
outiside Trello
Chrome plugin
POST /train
GET /cards
POST /predict
Raw data
Clean data
Model
Score
Fetch
Preprocess
Train
Optimize
Clean
Predict
https://github.com/AntoineToubhans/trellearn
Front
Back
Trello API
cards.json
card.json
sorted_labels.json
My server (learning)
Preprocessing
Learning
Validating
[{
"id": "57c8222ca15bee7064d24487",
"closed": false,
"dateLastActivity": "2016-09-29T08:54:47.064Z",
"desc": "Trouver une image de singe amusant et (...)",
"idBoard": "526a7df0e4fb2b90030021ac",
"idList": "57ab3d43083b138ec2af5ef0",
"idAttachmentCover": null,
"name": "ETQU, je vois un singe amusant sur (...)",
(...)
"due": null,
"idChecklists": [],
"idMembers": [],
"labels": [
{
"id": "566adb33fb396fe706d2d5a9",
"idBoard": "526a7df0e4fb2b90030021ac",
"name": "Bug",
"color": "pink",
"uses": 127
},
{
"id": "56557ef4fb396fe706bac607",
"idBoard": "526a7df0e4fb2b90030021ac",
"name": "Régression",
"color": "black",
"uses": 26
},
{
"id": "57ad88a784e677fd36e6c417",
"idBoard": "526a7df0e4fb2b90030021ac",
"name": "Prochain sprint",
"color": "green",
"uses": 7,
}
],
"subscribed": false,
}, {
(...)
}]
x_i = [0, ..., 3, 2, ... ]
xi=[0,...,3,2,...]
y_i = [0, ..., 1, 1, 1 ]
yi=[0,...,1,1,1]
Preprocessing
X =
\left[
\begin{array}{c}
x_0 \\
... \\
x_N
\end{array}
\right]
X=⎣⎡x0...xN⎦⎤
Y =
\left[
\begin{array}{c}
y_0 \\
... \\
y_N
\end{array}
\right]
Y=⎣⎡y0...yN⎦⎤
Learning
x_1
x1
x_2
x2
Logistic regression
\omega_0 + \omega_1 x_1 + \omega_2 x_2 \leq 0
ω0+ω1x1+ω2x2≤0
w^Tx \leq 0
wTx≤0
\sigma: t \mapsto \frac{1}{1+e^{-t}}
σ:t↦1+e−t1
C(w) = \sum_{i=0}^N (y_i - \sigma(\omega^Tx_i))^2
C(w)=∑i=0N(yi−σ(ωTxi))2
x_1
x1
x_2
x2
Multi-class Learning
vs {
,
}
vs {
,
}
vs {
,
}
Multi-label Learning
x_1
x1
x_2
x2
p=0.74
p=0.74
p=0.52
p=0.52
p=0.09
p=0.09
Naive Bayes
P(y|x) =
P(y∣x)=
P(x, y)
P(x,y)
P(x)
P(x)
=
=
P(x|y)P(y)
P(x∣y)P(y)
P(x)
P(x)
Hypothesis:
P(x_i|y,x_1,\ldots,x_{i-1},\ldots,x_{i+1},\ldots) = P(x_i|y)
P(xi∣y,x1,…,xi−1,…,xi+1,…)=P(xi∣y)
P(x,y)=P(x_1|x_2,\ldots,y)P(x_2,\ldots,y)
P(x,y)=P(x1∣x2,…,y)P(x2,…,y)
=P(x_1|y)P(x_2,\ldots,y)
=P(x1∣y)P(x2,…,y)
=\Pi_1^N P(x_i|y) P(y)
=Π1NP(xi∣y)P(y)
Validation
Over-fitting !
Cross-validation
dataset
Learning
Validating
Score function
S(y,\hat{y}) =
\frac{1}{\#\{j\;\vert\; y_j = 1\}}
\sum_{j\;|\;y_j = 1}
\frac{
\#\{k\;\vert\; y_k=1 \;\wedge\; \hat{y}_k \geq \hat{y}_j \}
}{
\#\{k\;\vert\; \hat{y}_k \geq \hat{y}_j \}
}
S(y,y^)=#{j∣yj=1}1∑j∣yj=1#{k∣y^k≥y^j}#{k∣yk=1∧y^k≥y^j}
S: [0,1]^N \times [0,1]^N \longrightarrow [0,1]
S:[0,1]N×[0,1]N⟶[0,1]
Demo Time !
Trello cards labelling with SciKit-Learn
By Antoine Toubhans
Trello cards labelling with SciKit-Learn
- 972