(Spoiler Alert: It's not that simple)
(Spoiler Alert: It's not that simple)
What is this?
(Spoiler Alert: It's not that simple)
Patterns from previous information
(Spoiler Alert: It's not that simple)
What is this?
Car
(Spoiler Alert: It's not that simple)
Patterns from previous information
What is this?
(Spoiler Alert: It's not that simple)
What is this?
WTF is happening here?
(Spoiler Alert: It's not that simple)
What is this?
Black Magic?
(Spoiler Alert: It's not that simple)
What is this?
Neuroscience try to understand
(Spoiler Alert: It's not that simple)
What is this?
AI/ML/Mathematics/Computer Science try to imitate through Mathematical Models
(Spoiler Alert: It's not that simple)
What is this?
Information Theory
Information Theory
Linear Algebra
Information Theory
Linear Algebra
Statistics
Information Theory
Linear Algebra
Statistics
Probability
Model
Model
Decisão?
Julgamento?
Sentença?
Ato Serventuário?
Trained Data
Trained Data
Learning Algorithm
Trained Data
Learning Algorithm
Model
Trained Data
Learning Algorithm
Model
Incoming Data
Trained Data
Learning Algorithm
Model
Incoming Data
Predictions
Telepathy: Platform for data training
Xavier
Phase to select/extract features
Phase to select/extract features
Using TF-IDF to extract words importance
Phase to select/extract features
Using TF-IDF to extract words importance
vectorizer = TfidfVectorizer(sublinear_tf=True,
stop_words=stopwords.get_stop_words(),
token_pattern=r'\w{4,}',
max_features=10000,
ngram_range=(1,1),
strip_accents='unicode',
norm='l2')
Vectorized_X = vectorizer.fit_transform(X)
Model
Use pipeline to unify the process
Use Grid Search to tune parameters
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Extract Features and vectorize it
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
transform feature vector
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
pipeline = Pipeline([
('features', FeatureUnion([
('lengthtransformer', LengthTransformer()),
('tfidf', Pipeline([
('vect', vectorizer),
('to_dense', DenseTransformer()),
])),
])),
('estimators', FeatureUnion([
('perceptron', Perceptron(alpha=0.0001)),
('lr', LogisticRegression(C=5)),
('linearsvc', LinearSVC(dual=True,C=5)),
])),
('clf', ExtraTreesClassifier(n_estimators=70))
])
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
pipeline = Pipeline([
('features', FeatureUnion([
('lengthtransformer', LengthTransformer()),
('tfidf', Pipeline([
('vect', vectorizer),
('to_dense', DenseTransformer()),
])),
])),
('estimators', FeatureUnion([
('perceptron', Perceptron(alpha=0.0001)),
('lr', LogisticRegression(C=5)),
('linearsvc', LinearSVC(dual=True,C=5)),
])),
('clf', ExtraTreesClassifier(n_estimators=70))
])
Pipeline's interface is similar to Estimator's interface
it has transform() and fit()
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Useful to search best parameters automatically
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Useful to search best parameters automatically
It receives a estimator (Pipeline in this case) and a list of parameters
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Parameters Grid
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Parameters Grid
Best parameters combination
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
def tune_parameters(clf,X,y):
parameters = {
'features__tfidf__vect__max_features':[10000],
'features__tfidf__vect__ngram_range':[(1,1)],
'estimators__lr__C':[5],
'estimators__linearsvc__dual':[True],
'estimators__linearsvc__C':[5],
'estimators__perceptron__alpha':[0.0001],
'clf__n_estimators':[70],
}
grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters iset:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
def tune_parameters(clf,X,y):
parameters = {
'features__tfidf__vect__max_features':[10000],
'features__tfidf__vect__ngram_range':[(1,1)],
'estimators__lr__C':[5],
'estimators__linearsvc__dual':[True],
'estimators__linearsvc__C':[5],
'estimators__perceptron__alpha':[0.0001],
'clf__n_estimators':[70],
}
grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters iset:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Grid Search run tasks in parallel
And it works
Text
Ensemble Technique
- Pedro Domingos, University of Washington
Train more documents
Train more documents
Re-train wrong data
Train more documents
Re-train wrong data
Train more documents
Re-train wrong data
Approves!
Make it a service!
Make it a service!
Expose API method to classify documents