*(Spoiler Alert: It's not that simple)*

*(Spoiler Alert: It's not that simple)*

What is this?

*(Spoiler Alert: It's not that simple)*

Patterns from previous information

*(Spoiler Alert: It's not that simple)*

What is this?

Car

*(Spoiler Alert: It's not that simple)*

Patterns from previous information

What is this?

*(Spoiler Alert: It's not that simple)*

What is this?

WTF is happening here?

*(Spoiler Alert: It's not that simple)*

What is this?

Black Magic?

*(Spoiler Alert: It's not that simple)*

What is this?

Neuroscience try to understand

*(Spoiler Alert: It's not that simple)*

What is this?

AI/ML/Mathematics/Computer Science try to imitate through Mathematical Models

*(Spoiler Alert: It's not that simple)*

What is this?

Information Theory

Information Theory

Linear Algebra

Information Theory

Linear Algebra

Statistics

Information Theory

Linear Algebra

Statistics

Probability

Model

Model

Decisão?

Julgamento?

Sentença?

Ato Serventuário?

Trained Data

Trained Data

Learning Algorithm

Trained Data

Learning Algorithm

Model

Trained Data

Learning Algorithm

Model

Incoming Data

Trained Data

Learning Algorithm

Model

Incoming Data

Predictions

Telepathy: Platform for data training

Xavier

Phase to select/extract features

Phase to select/extract features

Using TF-IDF to extract words importance

Phase to select/extract features

Using TF-IDF to extract words importance

```
vectorizer = TfidfVectorizer(sublinear_tf=True,
stop_words=stopwords.get_stop_words(),
token_pattern=r'\w{4,}',
max_features=10000,
ngram_range=(1,1),
strip_accents='unicode',
norm='l2')
Vectorized_X = vectorizer.fit_transform(X)
```

Model

Use pipeline to unify the process

Use Grid Search to tune parameters

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Extract Features and vectorize it

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

transform feature vector

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

```
pipeline = Pipeline([
('features', FeatureUnion([
('lengthtransformer', LengthTransformer()),
('tfidf', Pipeline([
('vect', vectorizer),
('to_dense', DenseTransformer()),
])),
])),
('estimators', FeatureUnion([
('perceptron', Perceptron(alpha=0.0001)),
('lr', LogisticRegression(C=5)),
('linearsvc', LinearSVC(dual=True,C=5)),
])),
('clf', ExtraTreesClassifier(n_estimators=70))
])
```

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

```
pipeline = Pipeline([
('features', FeatureUnion([
('lengthtransformer', LengthTransformer()),
('tfidf', Pipeline([
('vect', vectorizer),
('to_dense', DenseTransformer()),
])),
])),
('estimators', FeatureUnion([
('perceptron', Perceptron(alpha=0.0001)),
('lr', LogisticRegression(C=5)),
('linearsvc', LinearSVC(dual=True,C=5)),
])),
('clf', ExtraTreesClassifier(n_estimators=70))
])
```

Pipeline's interface is similar to Estimator's interface

it has transform() and fit()

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Useful to search best parameters automatically

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Useful to search best parameters automatically

It receives a estimator (Pipeline in this case) and a list of parameters

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Parameters Grid

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Parameters Grid

Best parameters combination

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

```
def tune_parameters(clf,X,y):
parameters = {
'features__tfidf__vect__max_features':[10000],
'features__tfidf__vect__ngram_range':[(1,1)],
'estimators__lr__C':[5],
'estimators__linearsvc__dual':[True],
'estimators__linearsvc__C':[5],
'estimators__perceptron__alpha':[0.0001],
'clf__n_estimators':[70],
}
grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters iset:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
```

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

```
def tune_parameters(clf,X,y):
parameters = {
'features__tfidf__vect__max_features':[10000],
'features__tfidf__vect__ngram_range':[(1,1)],
'estimators__lr__C':[5],
'estimators__linearsvc__dual':[True],
'estimators__linearsvc__C':[5],
'estimators__perceptron__alpha':[0.0001],
'clf__n_estimators':[70],
}
grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters iset:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
```

Grid Search run tasks in parallel

And it works

Text

Ensemble Technique

*- Pedro Domingos, University of Washington*

Train more documents

Train more documents

Re-train wrong data

Train more documents

Re-train wrong data

Train more documents

Re-train wrong data

Approves!

Make it a service!

Make it a service!

Expose API method to classify documents