the Samanantar example

Putting it all together

1. Curate data from the web

Collect a large corpus of monolingual data in many languages (eg. IndicCorp)

Insert this into the data store with
ULCA compliance*

*data partner's role

2. Mine parallel sentences from corpora

Egress data with ULCA compliance

Engineer system to mine parallel sentences (eg. Samanantar)

Insert parallel sentences with
ULCA compliance

3. Evaluate quality of parallel dataset

Sample data from the parallel corpus data with ULCA compliance

Standardise a tool (eg. Karya) and metrics (eg. SemEval) for evaluating semantic similarity

Collect annotations from human effort

Insert data back into the data repository with ULCA compliance

4. Train AI models

Access training data with ULCA compliance

Train AI models (eg. IndicTrans)

Insert model card and API access to repository under ULCA compliance

5. Create benchmarks for MT

Sample data from the parallel corpus data with ULCA compliance

Standardise tool (eg. Karya) and rules (eg. no NMT tool to be used)

Collect benchmarks from human effort

Insert data back into the data repository with ULCA compliance

6. Auto-evaluate AI models on benchmarks

Both benchmarks and AI models are in repository

Standardise metrics for evaluating AI models (eg. BLEU, Prism, ...)

Publish this in a public/private leaderboard

7. Evaluate AI models on tools

Sample sentences from repository
with ULCA compliance

Standardise tools (eg. Karya with INMT) and metrics (eg. average post-edit time)

Insert data into model cards with
ULCA compliance

Samanantar slides

By One Fourth Labs

Samanantar slides

The AI4Bharat Initiative

One Fourth Labs

We deliver courseware in AI and related areas

the Samanantar example

Putting it all together

1. Curate data from the web

Collect a large corpus of monolingual data in many languages (eg. IndicCorp)

Insert this into the data store with ULCA compliance*

*data partner's role

2. Mine parallel sentences from corpora

Egress data with ULCA compliance

Engineer system to mine parallel sentences (eg. Samanantar)

Insert parallel sentences with ULCA compliance

3. Evaluate quality of parallel dataset

Sample data from the parallel corpus data with ULCA compliance

Standardise a tool (eg. Karya) and metrics (eg. SemEval) for evaluating semantic similarity

Collect annotations from human effort

Insert data back into the data repository with ULCA compliance

4. Train AI models

Access training data with ULCA compliance

Train AI models (eg. IndicTrans)

Insert model card and API access to repository under ULCA compliance

5. Create benchmarks for MT

Sample data from the parallel corpus data with ULCA compliance

Standardise tool (eg. Karya) and rules (eg. no NMT tool to be used)

Collect benchmarks from human effort

Insert data back into the data repository with ULCA compliance

6. Auto-evaluate AI models on benchmarks

Both benchmarks and AI models are in repository

Standardise metrics for evaluating AI models (eg. BLEU, Prism, ...)

Publish this in a public/private leaderboard

7. Evaluate AI models on tools

Sample sentences from repository with ULCA compliance

Standardise tools (eg. Karya with INMT) and metrics (eg. average post-edit time)

Insert data into model cards with ULCA compliance

Samanantar slides

More from One Fourth Labs

Insert this into the data store with
ULCA compliance*

Insert parallel sentences with
ULCA compliance

Sample sentences from repository
with ULCA compliance

Insert data into model cards with
ULCA compliance