Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

Given India's rich linguistic diversity, language technology is absolutely essential to magnify reach and impact of AI solutions

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

The social view

Touch Points: Digital

Users: Multlingual

Language technology is absolutely essential to magnify reach and impact in the social sector

The commercial view

Multilingual chatbots

Sentiment Analysis

Content Moderation

Code mixed song search

Speech

Multilingual Authoring Tools

... And the demand for these tools is increasing

A concrete use case

original judgements written in English

Translation essential for better information dissemination

Manual translation is tedious and time consuming

বা

हि

ಕ

म

ଓ

த

മ

తె

অ

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

ब

मै

22 constitutional languages

What's the problem?

বা

हि

ಕ

म

ଓ

த

മ

తె

অ

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

ब

मै

Well, just use an automatic translation system!

No free, open-source, accurate translation system for Indic languages!

What would it take to solve this problem?

A large number of parallel sentences between En and Indic languages

DATA

MODELS

COMPUTE*

Modern infrastructure to train large scale models with millions of data points

This talk

Large scale models with (minor) innovations specific to Indic languages

*Most large scale models (T5, mT5, mBART) come out of deep tech companies

हि

How did we address the data problem?

Principle: Curate data from the web, manual collection is too expensive and time consuming

WEB SOURCES

Comparable

Non-Comparable

(Monolingual, billion tokens)

Machine Readable

हि

Non-Machine Readable

हि

ગુ

... ...

How did we address the data problem?

Shared multilingual vector space*

* Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang, Language-agnostic BERT Sentence Embedding, arXiv 2020

24 such news sources considered in this work with data from 2010 onwards

https://mykhel.com/

हि

https://hindi.mykhel.com/

Jan 2020

How did we address the data problem?

https://tn.gov.in/

* Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang, Language-agnostic BERT Sentence Embedding, arXiv 2020

Legislative proceedings from Tamil Nadu, Andhra Pradesh, Telangana, West Bengal, Bangladesh

en_budget_2020.pdf

ta_budget_2020.pdf

https://tn.gov.in/

OCR

த

Shared multilingual vector space*

How did we address the data problem?

Jeff Johnson, Matthijs Douze, Hervé Jégou, Billion-scale similarity search with GPUs, arXiv, 2019

IndicCorp contains 1.3M (Assamese) to 100M (English) monolingual sentences for 11 Indic languages

FAISS Index* for efficient indexing, clustering, semantic matching, and retrieval of dense vectors (1000 sent./sec)

100M

64M

Brute force search (100M x 64M) is infeasible)

How much data did we collect?

33M parallel sentences mined from the web (3X improvement)

Qualitative Analysis

10000 samples manually evaluated using 30+ annotators across 11 languages

Average rating of sentence pairs around 4.17 (min:3.83, max:4.82)

Quality depends on resource size (lowest for As, Or; highest for Hi, Bn)

LaBSE alignment scores is negatively correlated with sentence length

What is the model that we use?

હું

અહીં

છું

Multi-headed self attention

Add and Normalise

હું અહીં છું

હું

અહીં

છું

I am here

हुं अहीं छुं

Decoder

Transformer

Highlights

Joint multilingual model for 11 Indic languages

3 models: En-X, X-En and X-X

Single script (enables transfer, reduces vocabulary)

6 encoder layers, 6 decoder layers, 16 heads/layer

wide models (ff_dim=4096, embedding_dim=1536)

বা

हि

ಕ

म

ଓ

த

മ

తె

অ

ગુ

ਪੰ

How do our models perform?

Highlights

State of the art performance

Gains are higher for low resource languages

Impact

Deployed for translating Supreme Court judgements

“The quality of translations is significantly improved. I would say this is more so for the legal document where there were complicated sentences/cards which were translated very well. The syntax mostly did not falter even in the face of multiple ideas/information contained in one sentence.”

“The amount of time spent on correcting/improving the translation has dropped.”

“THIS IS VERY PROMISING. AMAZED BY THE SPEED.”

“I TOOK A PRINTOUT AND WENT THROUGH EVERY LINE. THE TRANSLATION IS 98% ACCURATE AND HIGHLY SATISFIED.”

Summary

Samanantar: Largest Parallel Corpus for Indic Languages

IndicTrans: State of the art translation models for En-X and X-En

https://github.com/AI4Bharat/indicTrans

https://indicnlp.ai4bharat.org/samanantar/

Contact Us

Let's make India ready for the AI age

Anoop Kunchukuttan

Mitesh M. Khapra

Pratyush Kumar

anoop.kunchukuttan@gmail.com

miteshk@cse.iitm.ac.in

pratyush@cse.iitm.ac.in

Copy of Samanantar

By One Fourth Labs

Copy of Samanantar

A hub for Indic NLP resources, datasets and tools

One Fourth Labs

We deliver courseware in AI and related areas