NLP Engineer
Context
You get to lead a project. Yay!
Your role is to try to figure out the problem.
It is same dataset and similar to first exercise, but some details are different.
The first part is to talk to the Client.
After getting some context from them
continue to next slide...
What is in the data?
The client started the timer
You're going to start doing annotations following Annotator's lead to see what information is in the data.
Focus on which fields are missing.
The Client will time box it,
when the time runs out go to next slide
Negotiate
As you look at the data, two fields are missing:
"number of employees" and "business owner".
Try to convince the client that both of them should be dropped. Keep in mind that they're paying your bills and what they say goes.
Game master will tell you when to move on
What is a cake shop, a cafe, a deli, a bakery?
Can you come up with clear annotation rules that are consistent and make client happy?
Click ahead
Coffee Shop Rules:
"You hired me because of my ruthless efficiency. Let's quickly come up with some rules for coffee shops. If we can't do it in 5 minutes, we will have to descope it from this sprint. Let's discuss them, one by one, to distinguish whether a business is a coffee shop or not."
Coffee Shop Rule 1:
It's a coffee shop if it has a cafe in name
Ask the client first.
If they agree then ask the annotator what they think
Coffee Shop Rule 2:
That didn't work, let's try something else
It's a coffee shop if it has picture of a coffee
Ask the client first.
If they agree then ask the annotator what they think
Coffee Shop Rule 3:
Well this one should definitely work!
It's not a coffee shop, it's a cake shop if they sell cakes
Ask the client first.
If they agree then ask the annotator what they think
Negotiation
Uh, oh! This is difficult!
It turns out that the rules for the type of business are contradictory, so let's solve that in "next sprint". Focus on Information Extraction now.
Let's find something we all agree on
- Postcode - we all know what US postcode looks like. Relief!
- Business name
- Business owner name
How are we going to do that?
Click ahead
Let's use regex
{"label":"POSTCODE","pattern":[{"IS_DIGIT":true,"LENGTH":{"==":5}}]}
In the tool provided, we will use this regex pattern to find postcodes
Let's use a pre-trained model
Business names are very similar to Named Entity type ORG in a pre-trained Named Entity Recognition model in spacy. Let's use it!
Business owner is a sub-class of a person.
{"label":"BUSINESS_NAME","pattern":[{"ENT_TYPE":"ORG"}]}
{"label":"BUSINESS_NAME","pattern":[{"ENT_TYPE":"GPE"}]}
{"label":"BUSINESS_OWNER_PERSON","pattern":[{"ENT_TYPE":"PERSON"}]}
Explain to annotator how to use Annotation UI
The data will already be loaded into the tool
You will only see one sentence at a time so that it's easier to filter information
Annotation UI
Subtitle

Let's have another look at this tool we are going to use. Ask annotator to open the link to it
What do the buttons do?
- When you are happy with the labels assigned, click the green button to accept
- To skip: Grey stop
- To "Go back"Grey back arrow
Unnecessary today:
- Red button is incorrectly marked up data

Congrats!
Now you have ground truth of 150 golden labels
In the future, we can test our rules/models on them
Part 2 over!
NLP Engineer part II
By Agata Sumowska
NLP Engineer part II
- 336