Cadet is a web application designed to simplify the addition of a new language to spaCy.

  • Cadet generates a generic spaCy Language object with preset defaults.
  • Cadet helps to adjust those defaults to your language.
  • Cadet uses lookups data to automatically annotate frequent terms.
  • Cadet exports your tokenized texts for annotation in INCEpTION. 
  • Cadet prepares a new spaCy language object for use in model training.

It does what? 

When you use spaCy it begins with a Language object that provides a text-processing pipeline for your specific language.    

 

The Language object holds data on specific attributes of your language.

 

Is your language right to left or left to right?

What script does it use? 

Does it have distinct punctuation marks?

The Language object

  1. splits the text into tokens (tokenizer)
  2. adds token attributes with rules and lookups
  3. orders and coordinates other pipeline components 
  4. It transforms raw text into the Doc object

Tokenization rules

  1. regular rules for prefix, suffic and infix
  2. specific exceptions

Lookups

A dictionary to lookup information about a token.

For example, we might lookup the part of speech for "house" and get "NOUN" 

 

Cadet lookups are stored in json files

{
   "house":"NOUN",
   "fish":"VERB"
}

They only work for unambiguous information. "Fish" can be a VERB or a NOUN.  The meaning of the word depends on context and should be annotated manually in INCEpTION. 

Frequent Terms 

Annotation work ("tagging") can be a time and labor intensive process involving large teams of annotators.

 

One way to significantly reduce the time needed for annotation is to bulk-update terms that appear frequently and have consistent meaning.  

 

Cadet identifies frequent terms in your corpus and provides an interface to add lemma and part of speech information. 

Export to INCEpTION

With everything in place, we're not ready to pass your corpus on to INCEpTION for annotation work.  

Cadet returns a file for each of your texts that has been tokenized, split into sentences, with frequent terms automatically annotated using your lookups.

 

There's still a lot of work to do, but hopefully Cadet has significantly reduced the time needed to annotate your corpus.  

 

It also provides the spaCy Language object that you'll need when you come back to train a model on your annotated texts. 

deck

By Andrew Janco

deck

  • 390