Cadet is a web application designed to simplify the addition of a new language to spaCy.
It does what?
When you use spaCy it begins with a Language object that provides a text-processing pipeline for your specific language.
The Language object holds data on specific attributes of your language.
Is your language right to left or left to right?
What script does it use?
Does it have distinct punctuation marks?
The Language object
Tokenization rules
Lookups
A dictionary to lookup information about a token.
For example, we might lookup the part of speech for "house" and get "NOUN"
Cadet lookups are stored in json files
{
"house":"NOUN",
"fish":"VERB"
}
They only work for unambiguous information. "Fish" can be a VERB or a NOUN. The meaning of the word depends on context and should be annotated manually in INCEpTION.
Frequent Terms
Annotation work ("tagging") can be a time and labor intensive process involving large teams of annotators.
One way to significantly reduce the time needed for annotation is to bulk-update terms that appear frequently and have consistent meaning.
Cadet identifies frequent terms in your corpus and provides an interface to add lemma and part of speech information.
Export to INCEpTION
With everything in place, we're not ready to pass your corpus on to INCEpTION for annotation work.
Cadet returns a file for each of your texts that has been tokenized, split into sentences, with frequent terms automatically annotated using your lookups.
There's still a lot of work to do, but hopefully Cadet has significantly reduced the time needed to annotate your corpus.
It also provides the spaCy Language object that you'll need when you come back to train a model on your annotated texts.