How to setup an nlp analysis

in 15 min?

Use case: a Trello board

Antoine Toubhans, Paris NLP Meetup #6, July 26, 2017

Front-end

Back-end

Trello API

My server

browser

within Trello world

outiside Trello

Chrome plugin

POST /train

GET /cards

POST /predict

Raw data

 Clean data

Model

Score

Fetch

Preprocess

Train

Optimize

Clean

Predict

https://github.com/AntoineToubhans/trellearn

Front

Back

Trello API

cards.json

card.json

sorted_labels.json

My server (learning)

Preprocessing

Learning

Validating

[{
    "id": "57c8222ca15bee7064d24487",
    "closed": false,
    "dateLastActivity": "2016-09-29T08:54:47.064Z",
    "desc": "Trouver une image de singe amusant et (...)",
    "idBoard": "526a7df0e4fb2b90030021ac",
    "idList": "57ab3d43083b138ec2af5ef0",
    "idAttachmentCover": null,
    "name": "ETQU, je vois un singe amusant sur (...)",
    (...)
    "due": null,
    "idChecklists": [],
    "idMembers": [],
    "labels": [
        {
            "id": "566adb33fb396fe706d2d5a9",
            "idBoard": "526a7df0e4fb2b90030021ac",
            "name": "Bug",
            "color": "pink",
            "uses": 127
        },
        {
            "id": "56557ef4fb396fe706bac607",
            "idBoard": "526a7df0e4fb2b90030021ac",
            "name": "Régression",
            "color": "black",
            "uses": 26
        },
        {
            "id": "57ad88a784e677fd36e6c417",
            "idBoard": "526a7df0e4fb2b90030021ac",
            "name": "Prochain sprint",
            "color": "green",
            "uses": 7,
        }
    ],
    "subscribed": false,
}, {
    (...) 
}]
x_i = [0, ..., 3, 2, ... ]
xi=[0,...,3,2,...]x_i = [0, ..., 3, 2, ... ]
y_i = [0, ..., 1, 1, 1 ]
yi=[0,...,1,1,1]y_i = [0, ..., 1, 1, 1 ]

Preprocessing

X = \left[ \begin{array}{c} x_0 \\ ... \\ x_N \end{array} \right]
X=[x0...xN]X = \left[ \begin{array}{c} x_0 \\ ... \\ x_N \end{array} \right]
Y = \left[ \begin{array}{c} y_0 \\ ... \\ y_N \end{array} \right]
Y=[y0...yN]Y = \left[ \begin{array}{c} y_0 \\ ... \\ y_N \end{array} \right]

Learning

x_1
x1x_1
x_2
x2x_2

Logistic regression

\omega_0 + \omega_1 x_1 + \omega_2 x_2 \leq 0
ω0+ω1x1+ω2x20\omega_0 + \omega_1 x_1 + \omega_2 x_2 \leq 0
w^Tx \leq 0
wTx0w^Tx \leq 0
\sigma: t \mapsto \frac{1}{1+e^{-t}}
σ:t11+et\sigma: t \mapsto \frac{1}{1+e^{-t}}
C(w) = \sum_{i=0}^N (y_i - \sigma(\omega^Tx_i))^2
C(w)=i=0N(yiσ(ωTxi))2C(w) = \sum_{i=0}^N (y_i - \sigma(\omega^Tx_i))^2
x_1
x1x_1
x_2
x2x_2

Multi-class Learning

vs {

,

}

vs {

,

}

vs {

,

}

Multi-label Learning

x_1
x1x_1
x_2
x2x_2
p=0.74
p=0.74p=0.74
p=0.52
p=0.52p=0.52
p=0.09
p=0.09p=0.09

Naive Bayes

P(y|x) =
P(yx)=P(y|x) =
P(x, y)
P(x,y)P(x, y)
P(x)
P(x)P(x)
=
= =
P(x|y)P(y)
P(xy)P(y)P(x|y)P(y)
P(x)
P(x)P(x)

Hypothesis:

P(x_i|y,x_1,\ldots,x_{i-1},\ldots,x_{i+1},\ldots) = P(x_i|y)
P(xiy,x1,,xi1,,xi+1,)=P(xiy)P(x_i|y,x_1,\ldots,x_{i-1},\ldots,x_{i+1},\ldots) = P(x_i|y)
P(x,y)=P(x_1|x_2,\ldots,y)P(x_2,\ldots,y)
P(x,y)=P(x1x2,,y)P(x2,,y)P(x,y)=P(x_1|x_2,\ldots,y)P(x_2,\ldots,y)
=P(x_1|y)P(x_2,\ldots,y)
=P(x1y)P(x2,,y)=P(x_1|y)P(x_2,\ldots,y)
=\Pi_1^N P(x_i|y) P(y)
=Π1NP(xiy)P(y)=\Pi_1^N P(x_i|y) P(y)

Validation

Over-fitting !

Cross-validation

dataset

Learning

Validating

Score function

S(y,\hat{y}) = \frac{1}{\#\{j\;\vert\; y_j = 1\}} \sum_{j\;|\;y_j = 1} \frac{ \#\{k\;\vert\; y_k=1 \;\wedge\; \hat{y}_k \geq \hat{y}_j \} }{ \#\{k\;\vert\; \hat{y}_k \geq \hat{y}_j \} }
S(y,y^)=1#{jyj=1}jyj=1#{kyk=1y^ky^j}#{ky^ky^j}S(y,\hat{y}) = \frac{1}{\#\{j\;\vert\; y_j = 1\}} \sum_{j\;|\;y_j = 1} \frac{ \#\{k\;\vert\; y_k=1 \;\wedge\; \hat{y}_k \geq \hat{y}_j \} }{ \#\{k\;\vert\; \hat{y}_k \geq \hat{y}_j \} }
S: [0,1]^N \times [0,1]^N \longrightarrow [0,1]
S:[0,1]N×[0,1]N[0,1]S: [0,1]^N \times [0,1]^N \longrightarrow [0,1]

Demo Time !

Trello cards labelling with SciKit-Learn

By Antoine Toubhans

Trello cards labelling with SciKit-Learn

  • 972