JusBrasil's

Machine Learning

Architecture to

Predict Document's

Labels

What is 

What is 

Art

What is 

Art

Machine Learning

?

The Resurrection of the Artificial Intelligence

The Resurrection of the Artificial Intelligence

And many more.

The Resurrection of the Artificial Intelligence

Car

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Car

What is this?

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Machine Learning is as simple as Human Learning

Patterns from previous information

(Spoiler Alert: It's not that simple)

What is this?

Car

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Patterns from previous information

What is this?

Car

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

What is this?

WTF is happening here?

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Car

What is this?

Black Magic?

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Car

What is this?

Neuroscience try to understand

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Car

What is this?

AI/ML/Mathematics/Computer Science try to imitate through Mathematical Models

Machine Learning is as simple as Human Learning

(Spoiler Alert: It's not that simple)

Car

What is this?

Machine Learning is a mix of many Sciences

Information Theory

Information Theory

Linear Algebra

Machine Learning is a mix of many Sciences

Information Theory

Linear Algebra

Machine Learning is a mix of many Sciences

Statistics

Information Theory

Linear Algebra

Statistics

Probability

Machine Learning is a mix of many Sciences

...And JusBrasil wants to do the same

...And JusBrasil wants to do the same

Model

...And JusBrasil wants to do the same

Model

Decisão?

Julgamento?

Sentença?

Ato Serventuário?

...And JusBrasil wants to do the same

Trained Data

Supervised Learning

Trained Data

Learning Algorithm

Supervised Learning

Trained Data

Learning Algorithm

Model

Supervised Learning

Trained Data

Learning Algorithm

Model

Incoming Data

Supervised Learning

Trained Data

Learning Algorithm

Model

Incoming Data

Predictions

Supervised Learning

The problem?

Supervised Learning

Real world datasets are messy

Supervised Learning

Real world datasets are messy

Supervised Learning

All we want is the perfect fit

Supervised Learning

But over/under fitting is also a problem

Supervised Learning

Supervised Learning

Key to a good Supervised Learning:

Right dataset

Supervised Learning

Key to a good Supervised Learning:

Right dataset

Right features

Supervised Learning

Key to a good Supervised Learning:

Right dataset

Right features

Right algorithms

Supervised Learning

Key to a good Supervised Learning:

Right dataset

Right features

Right algorithms

Right techniques

Supervised Learning

Key to a good Supervised Learning:

Key to a good Supervised Learning:

A good Architecture

Supervised Learning

Architecture for Automagic Text Classification

First we need a way to gather trained data

Architecture for Automagic Text Classification

Telepathy: Platform for data training

Architecture for Automagic Text Classification

Xavier

Feature Engineering

Feature Engineering

Phase to select/extract features

Feature Engineering

Phase to select/extract features

Using TF-IDF to extract words importance

Feature Engineering

Phase to select/extract features

Using TF-IDF to extract words importance

vectorizer = TfidfVectorizer(sublinear_tf=True,
                            stop_words=stopwords.get_stop_words(),
                            token_pattern=r'\w{4,}',
                            max_features=10000,
                            ngram_range=(1,1),
                            strip_accents='unicode',
                            norm='l2')

Vectorized_X = vectorizer.fit_transform(X)

Process to create the model

Process to create the model

Model

Process to create the model

Strategy #1

Use pipeline to unify the process

Process to create the model

Strategy #2

Use Grid Search to tune parameters

Process to create the model

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Process to create the model

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Extract Features and vectorize it

Process to create the model

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

transform feature vector

Process to create the model

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

pipeline = Pipeline([
                    ('features', FeatureUnion([
                        ('lengthtransformer', LengthTransformer()),
                        ('tfidf', Pipeline([
                            ('vect', vectorizer),
                            ('to_dense', DenseTransformer()),
                        ])),
                    ])),
                    ('estimators', FeatureUnion([
                        ('perceptron', Perceptron(alpha=0.0001)),
                        ('lr', LogisticRegression(C=5)),
                        ('linearsvc', LinearSVC(dual=True,C=5)),
                        ])),
                    ('clf', ExtraTreesClassifier(n_estimators=70))
                ])

Process to create the model

Main Pipeline

Length Transformer

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

pipeline = Pipeline([
                    ('features', FeatureUnion([
                        ('lengthtransformer', LengthTransformer()),
                        ('tfidf', Pipeline([
                            ('vect', vectorizer),
                            ('to_dense', DenseTransformer()),
                        ])),
                    ])),
                    ('estimators', FeatureUnion([
                        ('perceptron', Perceptron(alpha=0.0001)),
                        ('lr', LogisticRegression(C=5)),
                        ('linearsvc', LinearSVC(dual=True,C=5)),
                        ])),
                    ('clf', ExtraTreesClassifier(n_estimators=70))
                ])

Pipeline's interface is similar to Estimator's interface

it has transform() and fit()

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Useful to search best parameters automatically 

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Useful to search best parameters automatically 

It receives a estimator (Pipeline in this case) and a list of parameters

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Parameters Grid

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

Grid Search

Parameters Grid

Best parameters combination

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

def tune_parameters(clf,X,y):
    parameters = {
            'features__tfidf__vect__max_features':[10000],
            'features__tfidf__vect__ngram_range':[(1,1)],
            'estimators__lr__C':[5],
            'estimators__linearsvc__dual':[True],
            'estimators__linearsvc__C':[5],
            'estimators__perceptron__alpha':[0.0001],
            'clf__n_estimators':[70],
            }

    grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)

    grid_search.fit(X, y)

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters iset:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Process to create the model

Main Pipeline

Inner Pipeline

TFIDF vectorizer

Dense Transformer

Estimators (calling transform())

Perceptron

Logistic Regression

LinearSVC

Extra Randomized Trees

Tuning Pipeline's parameters

Length Transformer

def tune_parameters(clf,X,y):
    parameters = {
            'features__tfidf__vect__max_features':[10000],
            'features__tfidf__vect__ngram_range':[(1,1)],
            'estimators__lr__C':[5],
            'estimators__linearsvc__dual':[True],
            'estimators__linearsvc__C':[5],
            'estimators__perceptron__alpha':[0.0001],
            'clf__n_estimators':[70],
            }

    grid_search = GridSearchCV(clf, parameters, verbose=True, cv=FOLDS)

    grid_search.fit(X, y)

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters iset:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Grid Search run tasks in parallel

Process to create the model

Pipeline + Grid Search

Why Random Trees as Final Classifier?

Random Tree is a great method

Information Theory's Information Gain worked amazingly well

Variance Reduction worked well too!

Combining many trees is cool

Combining many trees is cool

And it works

Text

Ensemble Technique

Results

Previous dataset (highly biased) + previous process

Results

Previous dataset (highly biased) + previous process

<60% prediction precision

Results

Previous dataset (highly biased) + previous process

<60% prediction precision

Results

Previous dataset (highly biased) + new process

Results

Previous dataset (highly biased) + new process

~61% prediction precision

Results

Previous dataset (highly biased) + new process

~61% prediction precision

Results

New dataset (re-trained from scratch) + new process

Results

New dataset (re-trained from scratch) + new process

85% prediction precision (and growing)

Lessons Learned

Lessons Learned

Unbalanced datasets sucks.

Lessons Learned

Datasets with too little examples sucks.

Lessons Learned

Low quality datasets sucks even more.

Lessons Learned

There are good algorithms and bad algorithms for given situation. Chose wisely.

Lessons Learned

There are good algorithms and bad algorithms for given situation. Chose wisely.

Lessons Learned

Grid Search can be *really* slow. Make sure to run it in parallel.... otherwise...

Lessons Learned

It can be embarrassingly slow

Lessons Learned

Feature Union is a great idea. Don't stick only to TF-IDF.

Lessons Learned

Do not test your model against data used to train the model.

Lessons Learned

Do not test your model against data used to train the model.

Lessons Learned

Always cross validate.

Lessons Learned

Intuition fails in high dimensions

Lessons Learned

More data beats a cleverer algorithm

- Pedro Domingos, University of Washington

Next Steps

Improve dataset:

Improve dataset:

Train more documents

Improve dataset:

Train more documents

Re-train wrong data

Improve dataset:

Train more documents

Re-train wrong data

Crowd Sourcing?

Improve dataset:

Train more documents

Re-train wrong data

Crowd Sourcing?

Approves!

Migrate Xavier to Tsuru:

Make it a service!

Migrate Xavier to Tsuru:

Make it a service!

Expose API method to classify documents

classified_doc = api.get(method='classify_doc', doc)

Questions?

Thank you! 

MachineLearningJusbrasilDocCLass

By Rodrigo Araújo

MachineLearningJusbrasilDocCLass

  • 1,942