JusBrasil's
Machine Learning
Architecture to
Predict Document's
Labels
What is
What is
Art
What is
Art
Machine Learning
?
The Resurrection of the Artificial Intelligence
The Resurrection of the Artificial Intelligence
And many more.
The Resurrection of the Artificial Intelligence
Car
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Car
What is this?
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Machine Learning is as simple as Human Learning
Patterns from previous information
(Spoiler Alert: It's not that simple)
What is this?
Car
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Patterns from previous information
What is this?
Car
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
What is this?
WTF is happening here?
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Car
What is this?
Black Magic?
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Car
What is this?
Neuroscience try to understand
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Car
What is this?
AI/ML/Mathematics/Computer Science try to imitate through Mathematical Models
Machine Learning is as simple as Human Learning
(Spoiler Alert: It's not that simple)
Car
What is this?
Machine Learning is a mix of many Sciences
Information Theory
Information Theory
Linear Algebra
Machine Learning is a mix of many Sciences
Information Theory
Linear Algebra
Machine Learning is a mix of many Sciences
Statistics
Information Theory
Linear Algebra
Statistics
Probability
Machine Learning is a mix of many Sciences
...And JusBrasil wants to do the same
...And JusBrasil wants to do the same
Model
...And JusBrasil wants to do the same
Model
Decisão?
Julgamento?
Sentença?
Ato Serventuário?
...And JusBrasil wants to do the same
Trained Data
Supervised Learning
Trained Data
Learning Algorithm
Supervised Learning
Trained Data
Learning Algorithm
Model
Supervised Learning
Trained Data
Learning Algorithm
Model
Incoming Data
Supervised Learning
Trained Data
Learning Algorithm
Model
Incoming Data
Predictions
Supervised Learning
The problem?
Supervised Learning
Real world datasets are messy
Supervised Learning
Real world datasets are messy
Supervised Learning
All we want is the perfect fit
Supervised Learning
But over/under fitting is also a problem
Supervised Learning
Supervised Learning
Key to a good Supervised Learning:
Right dataset
Supervised Learning
Key to a good Supervised Learning:
Right dataset
Right features
Supervised Learning
Key to a good Supervised Learning:
Right dataset
Right features
Right algorithms
Supervised Learning
Key to a good Supervised Learning:
Right dataset
Right features
Right algorithms
Right techniques
Supervised Learning
Key to a good Supervised Learning:
Key to a good Supervised Learning:
A good Architecture
Supervised Learning
Architecture for Automagic Text Classification
First we need a way to gather trained data
Architecture for Automagic Text Classification
Telepathy: Platform for data training
Architecture for Automagic Text Classification
Xavier
Feature Engineering
Feature Engineering
Phase to select/extract features
Feature Engineering
Phase to select/extract features
Using TF-IDF to extract words importance
Feature Engineering
Phase to select/extract features
Using TF-IDF to extract words importance
vectorizer = TfidfVectorizer(sublinear_tf=True,
stop_words=stopwords.get_stop_words(),
token_pattern=r'\w{4,}',
max_features=10000,
ngram_range=(1,1),
strip_accents='unicode',
norm='l2')
Vectorized_X = vectorizer.fit_transform(X)
Process to create the model
Process to create the model
Model
Process to create the model
Strategy #1
Use pipeline to unify the process
Process to create the model
Strategy #2
Use Grid Search to tune parameters
Process to create the model
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Process to create the model
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Extract Features and vectorize it
Process to create the model
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
transform feature vector
Process to create the model
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
pipeline = Pipeline([
('features', FeatureUnion([
('lengthtransformer', LengthTransformer()),
('tfidf', Pipeline([
('vect', vectorizer),
('to_dense', DenseTransformer()),
])),
])),
('estimators', FeatureUnion([
('perceptron', Perceptron(alpha=0.0001)),
('lr', LogisticRegression(C=5)),
('linearsvc', LinearSVC(dual=True,C=5)),
])),
('clf', ExtraTreesClassifier(n_estimators=70))
])
Process to create the model
Main Pipeline
Length Transformer
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
pipeline = Pipeline([
('features', FeatureUnion([
('lengthtransformer', LengthTransformer()),
('tfidf', Pipeline([
('vect', vectorizer),
('to_dense', DenseTransformer()),
])),
])),
('estimators', FeatureUnion([
('perceptron', Perceptron(alpha=0.0001)),
('lr', LogisticRegression(C=5)),
('linearsvc', LinearSVC(dual=True,C=5)),
])),
('clf', ExtraTreesClassifier(n_estimators=70))
])
Pipeline's interface is similar to Estimator's interface
it has transform() and fit()
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Useful to search best parameters automatically
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Useful to search best parameters automatically
It receives a estimator (Pipeline in this case) and a list of parameters
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Parameters Grid
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
Grid Search
Parameters Grid
Best parameters combination
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
def tune_parameters(clf,X,y):
parameters = {
'features__tfidf__vect__max_features':[10000],
'features__tfidf__vect__ngram_range':[(1,1)],
'estimators__lr__C':[5],
'estimators__linearsvc__dual':[True],
'estimators__linearsvc__C':[5],
'estimators__perceptron__alpha':[0.0001],
'clf__n_estimators':[70],
}
grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters iset:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Process to create the model
Main Pipeline
Inner Pipeline
TFIDF vectorizer
Dense Transformer
Estimators (calling transform())
Perceptron
Logistic Regression
LinearSVC
Extra Randomized Trees
Tuning Pipeline's parameters
Length Transformer
def tune_parameters(clf,X,y):
parameters = {
'features__tfidf__vect__max_features':[10000],
'features__tfidf__vect__ngram_range':[(1,1)],
'estimators__lr__C':[5],
'estimators__linearsvc__dual':[True],
'estimators__linearsvc__C':[5],
'estimators__perceptron__alpha':[0.0001],
'clf__n_estimators':[70],
}
grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)
grid_search.fit(X, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters iset:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Grid Search run tasks in parallel
Process to create the model
Pipeline + Grid Search
Why Random Trees as Final Classifier?
Random Tree is a great method
Information Theory's Information Gain worked amazingly well
Variance Reduction worked well too!
Combining many trees is cool
Combining many trees is cool
And it works
Text
Ensemble Technique
Results
Previous dataset (highly biased) + previous process
Results
Previous dataset (highly biased) + previous process
<60% prediction precision
Results
Previous dataset (highly biased) + previous process
<60% prediction precision
Results
Previous dataset (highly biased) + new process
Results
Previous dataset (highly biased) + new process
~61% prediction precision
Results
Previous dataset (highly biased) + new process
~61% prediction precision
Results
New dataset (re-trained from scratch) + new process
Results
New dataset (re-trained from scratch) + new process
85% prediction precision (and growing)
Lessons Learned
Lessons Learned
Unbalanced datasets sucks.
Lessons Learned
Datasets with too little examples sucks.
Lessons Learned
Low quality datasets sucks even more.
Lessons Learned
There are good algorithms and bad algorithms for given situation. Chose wisely.
Lessons Learned
There are good algorithms and bad algorithms for given situation. Chose wisely.
Lessons Learned
Grid Search can be *really* slow. Make sure to run it in parallel.... otherwise...
Lessons Learned
It can be embarrassingly slow
Lessons Learned
Feature Union is a great idea. Don't stick only to TF-IDF.
Lessons Learned
Do not test your model against data used to train the model.
Lessons Learned
Do not test your model against data used to train the model.
Lessons Learned
Always cross validate.
Lessons Learned
Intuition fails in high dimensions
Lessons Learned
More data beats a cleverer algorithm
- Pedro Domingos, University of Washington
Next Steps
Improve dataset:
Improve dataset:
Train more documents
Improve dataset:
Train more documents
Re-train wrong data
Improve dataset:
Train more documents
Re-train wrong data
Crowd Sourcing?
Improve dataset:
Train more documents
Re-train wrong data
Crowd Sourcing?
Approves!
Migrate Xavier to Tsuru:
Make it a service!
Migrate Xavier to Tsuru:
Make it a service!
Expose API method to classify documents
classified_doc = api.get(method='classify_doc', doc)
Questions?
Thank you!
MachineLearningJusbrasilDocCLass
By Rodrigo Araújo
MachineLearningJusbrasilDocCLass
- 1,942