the Samanantar example

Putting it all together

1. Curate data from the web

Collect a large corpus of monolingual data in many languages (eg. IndicCorp)

Insert this into the data store with
ULCA compliance*

*data partner's role

2. Mine parallel sentences from corpora

Egress data with ULCA compliance

Engineer system to mine parallel sentences (eg. Samanantar)

Insert parallel sentences with
ULCA compliance

3. Evaluate quality of parallel dataset

Sample data from the parallel corpus data with ULCA compliance

Standardise a tool (eg. Karya) and metrics (eg. SemEval) for evaluating semantic similarity

Collect annotations from human effort

Insert data back into the data repository with ULCA compliance

4. Train AI models

Access training data with ULCA compliance

Train AI models (eg. IndicTrans)

Insert model card and API access to repository under ULCA compliance

5. Create benchmarks for MT

Sample data from the parallel corpus data with ULCA compliance

Standardise tool (eg. Karya) and rules (eg. no NMT tool to be used)

Collect benchmarks from human effort

Insert data back into the data repository with ULCA compliance

6. Auto-evaluate AI models on benchmarks

Both benchmarks and AI models are in repository

Standardise metrics for evaluating AI models (eg. BLEU, Prism, ...)

Publish this in a public/private leaderboard

7. Evaluate AI models on tools

Sample sentences from repository
with ULCA compliance

Standardise tools (eg. Karya with INMT) and metrics (eg. average post-edit time)

Insert data into model cards with
ULCA compliance

Made with Slides.com