the
Samanantar
example
Putting it all together
1. Curate data from the web
Collect a large corpus of monolingual data in many languages (eg. IndicCorp)
Insert this into the data store with
ULCA compliance*
*data partner's role
2. Mine parallel sentences from corpora
Egress data with
ULCA compliance
Engineer system to mine parallel sentences (eg. Samanantar)
Insert parallel sentences with
ULCA compliance
3. Evaluate quality of parallel dataset
Sample data from the parallel corpus data with
ULCA compliance
Standardise a tool (eg. Karya) and metrics (eg. SemEval) for evaluating semantic similarity
Collect annotations from human effort
Insert data back into the data repository with
ULCA compliance
4. Train AI models
Access training data with
ULCA compliance
Train AI models (eg. IndicTrans)
Insert model card and API access to repository under
ULCA compliance
5. Create benchmarks for MT
Sample data from the parallel corpus data with
ULCA compliance
Standardise tool (eg. Karya) and rules (eg. no NMT tool to be used)
Collect benchmarks from human effort
Insert data back into the data repository with
ULCA compliance
6. Auto-evaluate AI models on benchmarks
Both benchmarks and AI models are in repository
Standardise metrics for evaluating AI models (eg. BLEU, Prism, ...)
Publish this in a public/private leaderboard
7. Evaluate AI models on tools
Sample sentences from repository
with ULCA compliance
Standardise tools (eg. Karya with INMT) and metrics (eg. average post-edit time)
Insert data into model cards with
ULCA compliance
Made with Slides.com