How to setup an nlp analysis

in 15 min?

Use case: a Trello board

Antoine Toubhans, Paris NLP Meetup #6, July 26, 2017

Front-end

Back-end

Trello API

My server

browser

within Trello world

outiside Trello

Chrome plugin

POST /train

GET /cards

POST /predict

Raw data

Clean data

Model

Score

Fetch

Preprocess

Train

Optimize

Clean

Predict

https://github.com/AntoineToubhans/trellearn

Front

Back

Trello API

cards.json

card.json

sorted_labels.json

My server (learning)

Preprocessing

Learning

Validating

[{
    "id": "57c8222ca15bee7064d24487",
    "closed": false,
    "dateLastActivity": "2016-09-29T08:54:47.064Z",
    "desc": "Trouver une image de singe amusant et (...)",
    "idBoard": "526a7df0e4fb2b90030021ac",
    "idList": "57ab3d43083b138ec2af5ef0",
    "idAttachmentCover": null,
    "name": "ETQU, je vois un singe amusant sur (...)",
    (...)
    "due": null,
    "idChecklists": [],
    "idMembers": [],
    "labels": [
        {
            "id": "566adb33fb396fe706d2d5a9",
            "idBoard": "526a7df0e4fb2b90030021ac",
            "name": "Bug",
            "color": "pink",
            "uses": 127
        },
        {
            "id": "56557ef4fb396fe706bac607",
            "idBoard": "526a7df0e4fb2b90030021ac",
            "name": "Régression",
            "color": "black",
            "uses": 26
        },
        {
            "id": "57ad88a784e677fd36e6c417",
            "idBoard": "526a7df0e4fb2b90030021ac",
            "name": "Prochain sprint",
            "color": "green",
            "uses": 7,
        }
    ],
    "subscribed": false,
}, {
    (...) 
}]

x_i = [0, ..., 3, 2, ... ]

x_i = [0, ..., 3, 2, ... ]

y_i = [0, ..., 1, 1, 1 ]

y_i = [0, ..., 1, 1, 1 ]

Preprocessing

X = \left[ \begin{array}{c} x_0 \\ ... \\ x_N \end{array} \right]

X = \left[ \begin{array}{c} x_0 \\ ... \\ x_N \end{array} \right]

Y = \left[ \begin{array}{c} y_0 \\ ... \\ y_N \end{array} \right]

Y = \left[ \begin{array}{c} y_0 \\ ... \\ y_N \end{array} \right]

Learning

x_1

x_1

x_2

x_2

Logistic regression

\omega_0 + \omega_1 x_1 + \omega_2 x_2 \leq 0

\omega_0 + \omega_1 x_1 + \omega_2 x_2 \leq 0

w^Tx \leq 0

w^Tx \leq 0

\sigma: t \mapsto \frac{1}{1+e^{-t}}

\sigma: t \mapsto \frac{1}{1+e^{-t}}

C(w) = \sum_{i=0}^N (y_i - \sigma(\omega^Tx_i))^2

C(w) = \sum_{i=0}^N (y_i - \sigma(\omega^Tx_i))^2

x_1

x_1

x_2

x_2

Multi-class Learning

vs {

}

vs {

}

vs {

}

Multi-label Learning

x_1

x_1

x_2

x_2

p=0.74

p=0.74

p=0.52

p=0.52

p=0.09

p=0.09

Naive Bayes

P(y|x) =

P(y|x) =

P(x, y)

P(x, y)

P(x)

P(x)

=

P(x|y)P(y)

P(x|y)P(y)

P(x)

P(x)

Hypothesis:

P(x_i|y,x_1,\ldots,x_{i-1},\ldots,x_{i+1},\ldots) = P(x_i|y)

P(x_i|y,x_1,\ldots,x_{i-1},\ldots,x_{i+1},\ldots) = P(x_i|y)

P(x,y)=P(x_1|x_2,\ldots,y)P(x_2,\ldots,y)

P(x,y)=P(x_1|x_2,\ldots,y)P(x_2,\ldots,y)

=P(x_1|y)P(x_2,\ldots,y)

=P(x_1|y)P(x_2,\ldots,y)

=\Pi_1^N P(x_i|y) P(y)

=\Pi_1^N P(x_i|y) P(y)

Validation

Over-fitting !

Cross-validation

dataset

Learning

Validating

Score function

S(y,\hat{y}) = \frac{1}{\#\{j\;\vert\; y_j = 1\}} \sum_{j\;|\;y_j = 1} \frac{ \#\{k\;\vert\; y_k=1 \;\wedge\; \hat{y}_k \geq \hat{y}_j \} }{ \#\{k\;\vert\; \hat{y}_k \geq \hat{y}_j \} }

S(y,\hat{y}) = \frac{1}{\#\{j\;\vert\; y_j = 1\}} \sum_{j\;|\;y_j = 1} \frac{ \#\{k\;\vert\; y_k=1 \;\wedge\; \hat{y}_k \geq \hat{y}_j \} }{ \#\{k\;\vert\; \hat{y}_k \geq \hat{y}_j \} }

S: [0,1]^N \times [0,1]^N \longrightarrow [0,1]

S: [0,1]^N \times [0,1]^N \longrightarrow [0,1]

Demo Time !

Trello cards labelling with SciKit-Learn

By Antoine Toubhans

Trello cards labelling with SciKit-Learn

8 years ago
1,024

Trello cards labelling with SciKit-Learn

More from Antoine Toubhans