Presentations
Templates
Features
Teams
Pricing
Log in
Sign up
Log in
Sign up
Menu
the
Samanantar
example
Putting it all together
1. Curate data from the web
Collect a large corpus of monolingual data in many languages (eg. IndicCorp)
Insert this into the data store with
ULCA compliance*
*data partner's role
2. Mine parallel sentences from corpora
Egress data with
ULCA compliance
Engineer system to mine parallel sentences (eg. Samanantar)
Insert parallel sentences with
ULCA compliance
3. Evaluate quality of parallel dataset
Sample data from the parallel corpus data with
ULCA compliance
Standardise a tool (eg. Karya) and metrics (eg. SemEval) for evaluating semantic similarity
Collect annotations from human effort
Insert data back into the data repository with
ULCA compliance
4. Train AI models
Access training data with
ULCA compliance
Train AI models (eg. IndicTrans)
Insert model card and API access to repository under
ULCA compliance
5. Create benchmarks for MT
Sample data from the parallel corpus data with
ULCA compliance
Standardise tool (eg. Karya) and rules (eg. no NMT tool to be used)
Collect benchmarks from human effort
Insert data back into the data repository with
ULCA compliance
6. Auto-evaluate AI models on benchmarks
Both benchmarks and AI models are in repository
Standardise metrics for evaluating AI models (eg. BLEU, Prism, ...)
Publish this in a public/private leaderboard
7. Evaluate AI models on tools
Sample sentences from repository
with ULCA compliance
Standardise tools (eg. Karya with INMT) and metrics (eg. average post-edit time)
Insert data into model cards with
ULCA compliance
Samanantar slides
By One Fourth Labs
Made with Slides.com
Samanantar slides
The AI4Bharat Initiative
396
One Fourth Labs
We deliver courseware in AI and related areas
More from
One Fourth Labs