Cadet: A Tool to Add New Language Models to spaCy

Language Data + Models

Universal Dependencies

90 languages

StanfordNLP

53 languages

Polyglot  51 languages

DARIAH-RS

  • Serbian is a bigraphic language: it can be written in both Cyrillic and Latin, but current UD data only contains Cyrillic
  • Standard Serbian has two major dialects, Ekavian and Jekavian, which means that some words are not just pronounced but also written differently: млеко (mleko) vs. млијеко (mlijeko) = milk.
 

(Toma Tasovac)

To each project its own Data

To each project its own model

Every NLP project can benefit from the fine-tuning of statistical language models on project materials. This is especially true for:

  • Languages without existing linguistic data or models
  • Historical and regional variations of language
  • Domain-specific language  
  • Multilingual texts 
 

Always in tune

🌘 Cadet

in development

 

https://inception.apjan.co/

 guest   ~monolingualism

  • Accesible to non-programmers
  • Provide clear workflow for small teams to create new language and domain-specific models for spaCy

Interface to add or edit stop words, tokenization rules, lemmata, normalization rules, for base language object

 

 

Run as an external recommender for INCEpTION to generate annotation data

 

Actively update the spaCy model from annotations

 

Debug and batch train on annotation data

 

Package and export customized spaCy model

#TODO

Demo

  1. create project
  2. create custom language model
  3. serve the model to the web
  4. load INCEpTION and configure to load suggestions from cadet.apjan.co
  5. add annotations

Cadet

By Andrew Janco

Cadet

Disrupting Digital Monolingualism workshop, June 16, 2020

  • 618