Weak Supervision for NLP Tasks
HK ML Meetup
July 2020
Use ML for Cost and Consistency
Regulations
Standardize
Legal Review & Enhance
Final Product
Image Credit: Weak Supervision: A New Programming Paradigm for Machine Learning, Alex Ratner, http://ai.stanford.edu/blog/weak-supervision/
Image Credit: Weak Supervision: A New Programming Paradigm for Machine Learning, Alex Ratner, http://ai.stanford.edu/blog/weak-supervision/
[1]: Check out this talk for a much more in depth explanation: https://www.datacouncil.ai/talks/accelerating-machine-learning-with-training-data-management and also this blog: http://ai.stanford.edu/blog/weak-supervision/
[1]: There are much more comprehensive tutorials on the Snorkel website: https://www.snorkel.org/use-cases/. This demo is meant to be a very cursory introduction to the functionality; if you would like to learn more check out the docs.
[1]: How this is accomplished is quite interesting. For a detailed view check out section 4 of: Training Complex Models with Multi-Task Weak Supervision
Overall: 👍 recommended, worth at least exploring if you have high cost labeling scenarios
Potential to be useful in low-data scenarios, establishing baselines, small performance boosts on existing models
Great to gain a deeper understanding of a new problem space and/or new data
Can use other models as labeling functions, can combine signals
Can be used to pull in new data modes to existing models (e.g., caption text for image)
Works well for multi-task / ancillary tasks
Works well in conjunction with active learning
Return on time investment has high variance, not a slam dunk; getting to a useful output usually requires many iterative cycles
Performance gains will depend on your size of unlabeled data, quality of label functions, and ability to incorporate weighted labels
Need to do some accounting for sub-class scenarios; don't want to skew the distributions with homogeneous labeling functions. Ideally LF are:
many (more than 20 is good), and diverse
mostly correct (50%+ accuracy), and conditionally independent
Doesn't work well with tasks such as NER, where you need context
Best to use in conjunction with other orthogonal methods
Weak supervision can be a useful tool in your ML toolkit, helping to lower the cost and reduce the time needed to collect labeled data
Snorkel is a well engineered, open source library that will help with the nuts and bolts of collecting noisy labels and augmenting your training data
You will get the most return on your time in scenarios where the problem space is new/novel, where expert knowledge is scarce / costly, or where there are large volumes of unlabeled data