Weak Supervision for NLP Tasks


HK ML Meetup

July 2020


  • Problem Space
  • Weak Supervision
  • Snorkel
  • Demo
  • Lessons Learned
  • Questions (time permitting)

Use ML for Cost and Consistency



Legal Review & Enhance

Final Product

Scott: ML Lead @ Ascent

  •                        : we sell regulatory compliance tools and knowledge
  • Product: a searchable, standardized database of regulatory text from around the world

Weak Supervision

  • Problem: For supervised ML, collecting labels can be extremely costly and/or prohibitively time consuming
  • Can we somehow encode guidelines for labeling data, and rapidly apply them to large amounts of unlabeled data?
  • Potential Solution: Weak Supervision - "noisy, limited, or imprecise sources are used to provide supervision signal for labeling large amounts of training data in a supervised learning setting"

Weak Supervision

Image Credit:  Weak Supervision: A New Programming Paradigm for Machine Learning, Alex Ratner,

Weak Supervision

  • In short: collect a bunch of "noisy" labels using low cost shortcuts and then sort out the problems that arise with this approach later (e.g., conflicts, overlaps)


  • How to create noisy labels:
    • Encode domain knowledge from experts as labeling "rule"
    • Collect labels from mechanical turks / non-experts
    • Use related information (e.g., knowledge bases) and some knowledge transfer to label
    • Use specialized models for sub-tasks


  • Python library with suite of tools to assist with weak supervision tasks; mostly focused on NLP
  • Started by Alex Ratner while at Stanford University, has since grown into a very active open source project
  • Used in industry[1] to great effect


  • High level process:[1]
    • Incorporate domain knowledge into labeling functions
    • Resolve overlaps and conflicts with a label model
    • Use weighted labels to train final model

Image Credit:  Weak Supervision: A New Programming Paradigm for Machine Learning, Alex Ratner,

[1]: Check out this talk for a much more in depth explanation: and also this blog:

Snorkel: Demo[1]

  • Problem: Classic NLP IMDB movie review sentiment; given review text, determine if review is positive or negative


  • Twist:
    • Let's Assume we only start with 1000 labels; will use as test set
    • Will use Snorkel to create the rest of our labeled data

[1]: There are much more comprehensive tutorials on the Snorkel website:  This demo is meant to be a very cursory introduction to the functionality; if you would like to learn more check out the docs.

Snorkel: Demo

  • Create a labeling function
  • Apply labeling function to unlabeled data
  • Iterate on labeling functions
  • Create a label model to resolve overlaps and conflicts[1]
  • Filter out any rows with no information
  • Train classification model (with or without probability weighted labels)
  • Follow along with the code here

[1]: How this is accomplished is quite interesting.  For a detailed view check out section 4 of: Training Complex Models with Multi-Task Weak Supervision

Snorkel: Demo

  • Other features
    • Spacy Integration: can use NER, PoS tools to help build labels
    • Transformation Functions: can create data augmentation functions to enhance data (e.g., synonym replacement)
    • Sliced-based Learning: focus on subsets of classes / specific subproblems and weight importance

Lessons Learned

  • Overall: 👍 recommended, worth at least exploring if you have high cost labeling scenarios

  • Potential to be useful in low-data scenarios, establishing baselines, small performance boosts on existing models

  • Great to gain a deeper understanding of a new problem space and/or new data

  • Can use other models as labeling functions, can combine signals

  • Can be used to pull in new data modes to existing models (e.g., caption text for image)

  • Works well for multi-task / ancillary tasks

  • Works well in conjunction with active learning

Lessons Learned

  • Return on time investment has high variance, not a slam dunk; getting to a useful output usually requires many iterative cycles

  • Performance gains will depend on your size of unlabeled data, quality of label functions, and ability to incorporate weighted labels

  • Need to do some accounting for sub-class scenarios; don't want to skew the distributions with homogeneous labeling functions. Ideally LF are:

    • many (more than 20 is good), and diverse

    • mostly correct (50%+ accuracy), and conditionally independent

  • Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

  • Doesn't work well with tasks such as NER, where you need context

  • Best to use in conjunction with other orthogonal methods


  • Weak supervision can be a useful tool in your ML toolkit, helping to lower the cost and reduce the time needed to collect labeled data

  • Snorkel is a well engineered, open source library that will help with the nuts and bolts of collecting noisy labels and augmenting your training data

  • You will get the most return on your time in scenarios where the problem space is new/novel, where expert knowledge is scarce / costly, or where there are large volumes of unlabeled data



Made with