Shared multilingual vector space*

* Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang, Language-agnostic BERT Sentence Embedding, arXiv 2020

24 such news sources considered in this work with data from 2010 onwards

En

https://mykhel.com/

हि

https://hindi.mykhel.com/

Jan 2020

https://tn.gov.in/

* Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, Wei Wang, Language-agnostic BERT Sentence Embedding, arXiv 2020

Legislative proceedings from Tamil Nadu, Andhra Pradesh, Telangana, West Bengal, Bangladesh

en_budget_2020.pdf

ta_budget_2020.pdf

https://tn.gov.in/

OCR

OCR

En

Shared multilingual vector space*

Jeff Johnson, Matthijs Douze, Hervé Jégou, Billion-scale similarity search with GPUs, arXiv, 2019

IndicCorp contains 1.3M (Assamese) to 100M (English) monolingual sentences for 11 Indic languages

FAISS Index* for efficient indexing, clustering, semantic matching, and retrieval of dense vectors (1000 sent./sec)

En

100M

Hi

64M

Brute force search (100M x 64M) is infeasible)

33M parallel sentences mined from the web (3X improvement)

Qualitative Analysis

10000 samples manually evaluated using 30+ annotators across 11 languages

Average rating of sentence pairs around 4.17 (min:3.83, max:4.82)

Quality depends on resource size (lowest for As, Or; highest for Hi, Bn)

LaBSE alignment scores is negatively correlated with sentence length

 હું    અહીં    છું  

I am here

हुं     अहीं    छुं 

Transformer

Highlights

Joint multilingual model for 11 Indic languages

3 models: En-X, X-En and X-X 

Single script (enables transfer, reduces vocabulary)

6 encoder layers, 6 decoder layers, 16 heads/layer

wide models (ff_dim=4096, embedding_dim=1536)

বা

हि

తె

ગુ

ਪੰ

Highlights

State of the art performance

Gains are higher for low resource languages

Deployed for translating Supreme Court judgements

The quality of translations is significantly improved. I would say this is more so for the legal document where there were complicated sentences/cards which were translated very well. The syntax mostly did not falter even in the face of multiple ideas/information contained in one sentence.”


 

“The amount of time spent on correcting/improving the translation has dropped.”


 

“THIS IS VERY PROMISING. AMAZED BY THE SPEED.”


 

“I TOOK A PRINTOUT AND WENT THROUGH EVERY LINE. THE TRANSLATION IS 98% ACCURATE AND HIGHLY SATISFIED.”

 

Samanantar: Largest Parallel Corpus for Indic Languages

IndicTrans: State of the art translation models for En-X and X-En

https://github.com/AI4Bharat/indicTrans

https://indicnlp.ai4bharat.org/samanantar/

Contact Us

Let's make India ready for the AI age

Anoop Kunchukuttan 
Mitesh M. Khapra
Pratyush Kumar
anoop.kunchukuttan@gmail.com
miteshk@cse.iitm.ac.in
pratyush@cse.iitm.ac.in

Temp-Samanantar

By One Fourth Labs

Temp-Samanantar

A hub for Indic NLP resources, datasets and tools

  • 284