Snorkel

Weak Supervision for NLP Tasks

HK ML Meetup

July 2020

Overview

Problem Space
Weak Supervision
Snorkel
Demo
Lessons Learned
Questions (time permitting)

Use ML for Cost and Consistency

Regulations

Standardize

Legal Review & Enhance

Final Product

Scott: ML Lead @ Ascent

: we sell regulatory compliance tools and knowledge
Product: a searchable, standardized database of regulatory text from around the world

Weak Supervision

Problem: For supervised ML, collecting labels can be extremely costly and/or prohibitively time consuming

Can we somehow encode guidelines for labeling data, and rapidly apply them to large amounts of unlabeled data?

Potential Solution: Weak Supervision - "noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting"

Weak Supervision

Image Credit: Weak Supervision: A New Programming Paradigm for Machine Learning, Alex Ratner, http://ai.stanford.edu/blog/weak-supervision/

Weak Supervision

In short: collect a bunch of "noisy" labels using low cost shortcuts and then sort out the problems that arise with this approach later (e.g., conflicts, overlaps)

How to create noisy labels:
- Encode domain knowledge from experts as labeling "rule"
- Collect labels from mechanical turks / non-experts
- Use related information (e.g., knowledge bases) and some knowledge transfer to label
- Use specialized models for sub-tasks

Snorkel

Python library with suite of tools to assist with weak supervision tasks; mostly focused on NLP
Started by Alex Ratner while at Stanford University, has since grown into a very active open source project
Used in industry[1] to great effect

[1]: https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.html, https://arxiv.org/pdf/1812.06176.pdf

Snorkel

High level process:[1]
- Incorporate domain knowledge into labeling functions
- Resolve overlaps and conflicts with a label model
- Use weighted labels to train final model

Image Credit: Weak Supervision: A New Programming Paradigm for Machine Learning, Alex Ratner, http://ai.stanford.edu/blog/weak-supervision/

[1]: Check out this talk for a much more in depth explanation: https://www.datacouncil.ai/talks/accelerating-machine-learning-with-training-data-management and also this blog: http://ai.stanford.edu/blog/weak-supervision/

Snorkel: Demo[1]

Problem: Classic NLP IMDB movie review sentiment; given review text, determine if review is positive or negative

Twist:
- Let's Assume we only start with 1000 labels; will use as test set
- Will use Snorkel to create the rest of our labeled data

[1]: There are much more comprehensive tutorials on the Snorkel website: https://www.snorkel.org/use-cases/. This demo is meant to be a very cursory introduction to the functionality; if you would like to learn more check out the docs.

Snorkel: Demo

Create a labeling function
Apply labeling function to unlabeled data
Iterate on labeling functions
Create a label model to resolve overlaps and conflicts[1]
Filter out any rows with no information
Train classification model (with or without probability weighted labels)
Follow along with the code here

[1]: How this is accomplished is quite interesting. For a detailed view check out section 4 of: Training Complex Models with Multi-Task Weak Supervision

Snorkel: Demo

Other features
- Spacy Integration: can use NER, PoS tools to help build labels
- Transformation Functions: can create data augmentation functions to enhance data (e.g., synonym replacement)
- Sliced-based Learning: focus on subsets of classes / specific subproblems and weight importance

Lessons Learned

Overall: 👍 recommended, worth at least exploring if you have high cost labeling scenarios
Potential to be useful in low-data scenarios, establishing baselines, small performance boosts on existing models
Great to gain a deeper understanding of a new problem space and/or new data
Can use other models as labeling functions, can combine signals
Can be used to pull in new data modes to existing models (e.g., caption text for image)
Works well for multi-task / ancillary tasks
Works well in conjunction with active learning

Lessons Learned

Return on time investment has high variance, not a slam dunk; getting to a useful output usually requires many iterative cycles
Performance gains will depend on your size of unlabeled data, quality of label functions, and ability to incorporate weighted labels
Need to do some accounting for sub-class scenarios; don't want to skew the distributions with homogeneous labeling functions. Ideally LF are:
- many (more than 20 is good), and diverse
- mostly correct (50%+ accuracy), and conditionally independent
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Doesn't work well with tasks such as NER, where you need context
Best to use in conjunction with other orthogonal methods

Summary

Weak supervision can be a useful tool in your ML toolkit, helping to lower the cost and reduce the time needed to collect labeled data
Snorkel is a well engineered, open source library that will help with the nuts and bolts of collecting noisy labels and augmenting your training data
You will get the most return on your time in scenarios where the problem space is new/novel, where expert knowledge is scarce / costly, or where there are large volumes of unlabeled data

Snorkel

Overview

Scott: ML Lead @ Ascent

Weak Supervision

Weak Supervision

Weak Supervision

Snorkel

Snorkel

Snorkel: Demo[1]

Snorkel: Demo

Snorkel: Demo

Lessons Learned

Lessons Learned

Summary

Questions

References