@ IIT Madras

The AI4Bharat Initiative

Mission statement of the         Initiative

Bring parity in AI technology for Indian languages with English

Make fundamental contributions to state-of-the-art across language technologies - NLP, Speech, Sign language, OCR

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Associate Professor, IIT Madras
PhD, IIT Bombay
Areas - NLP, Deep Learning

 

Mitesh M. Khapra

Pratyush Kumar

Our Team

Anoop Kunchukuttan

Researcher, Microsoft Research

Ex-Assistant Professor, IIT Madras
PhD, ETH Zürich
Areas - Deep Learning, Systems

 

Researcher, Microsoft
PhD, IIT Bombay
Areas - NLP

+ many hard-working students and volunteers

मी तुम्हाला प्रतिकिलो 100 रुपये देऊ शकतो

100 கிலோ அரிசியை விற்க விரும்புகிறேன்

A conversation between a farmer  in Tamil Nadu and a wholesaler in Maharashtra

Speech and Language Technologies are absolutely essential to bridge the language divide in India

मी तुम्हाला प्रतिकिलो 100 रुपये देऊ शकतो

100 கிலோ அரிசியை விற்க விரும்புகிறேன்

A conversation between a farmer  in Tamil Nadu and a wholesaler in Maharashtra

मला १०० किलो तांदूळ विकायचा आहे

நான் உனக்கு  1 கிலோவுக்கு 100 ரூ தரேன்

(I want to sell 100 kg of rice)

(I can give you 100 Rs per kg)

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

Touch Points:     Digital 

Users:     Multilingual 

Speech and Language Technology are absolutely essential to magnify reach and impact in the social sector

What are the technology pieces that we need ?

Machine Translation

Speech Recognition

Speech Synthesis

Natural Language Understanding

We are far behind other languages!

(same recipe applicable for all tasks)

Huge amounts of data

Deep neural networks

Lot of compute power

What is the modern AI recipe ?

+

+

Have we tried this recipe?

+

+

Mined 33M parallel sentences from the web

Trained a ~1B parameter model*

Using state-of-the-art GPU infrastructure

(Machine Translation)

*One of the few attempts outside of DeepTech companies

Has someone tasted our recipe?

Deployed for translating Supreme Court judgements

The quality of translations is significantly improved. I would say this is more so for the legal document where there were complicated sentences/cards which were translated very well. The syntax mostly did not falter even in the face of multiple ideas/information contained in one sentence.”


 

“The amount of time spent on correcting/improving the translation has dropped.”


 

“THIS IS VERY PROMISING. AMAZED BY THE SPEED.”


 

“I TOOK A PRINTOUT AND WENT THROUGH EVERY LINE. THE TRANSLATION IS 98% ACCURATE AND HIGHLY SATISFIED.”

 

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

22 constitutional languages

So have we solved the problem?

NO! (far from it)

n \rightarrow {n \choose 2}
for language in languages: 
	
Build
    

Pretrained MT, NLU and Speech Models for 10 Indian languages (Building Blocks)

What are our milestones?

6 months

2 years

5 years

Demo for Hindi-Tamil conversation

বা

हि

ગુ

ਪੰ

తె

Demo for 50+ language pairs

Set up "The AI4Bharat Initiative"  @

What do we need?

Estimated Cost: 15 Cr INR per year for 5 years

Data

Team

Infrastructure

Outreach

40 Cr

20 Cr

14 Cr

1 Cr

Cost Breakdown

1 Million parallel sentences in 10 languages @ INR 20 per sentence

5000 hours of transcribed speech data in 10 languages @ INR 2000 per hour

NLU benchmarks with  a million sentences annotated in 10 languages @INR10 per sentence

20 Cr

10 Cr

10 Cr

Data Collection activity will be equally spaced over the next 5 years

Cost Breakdown

CTO (PhD + 3-5 years experience)

3 ML Researchers (PhD)

8 AI Residents  (B.Tech)

60 L

72 L

48 L

~4 Cr per year for the next 5 years

36 L

4 ML Engineers (B.Tech + 2-5 years experience)

3 Principal Investigators

120 L

COO

24 L

5 Admin Staff

12 L

Chief Evangelist Officer

24 L

Annual Salary

40 Cr

Cost Breakdown

5 DGX A100 Machines

7.5 Cr

Office Infrastructure (Space/Desktops/Laptops/Printers etc)

Cloud infrastructure (for storage, hosting services)

3 Cr

3 Cr

20 Cr

40 Cr

Cost Breakdown

10 workshops (2 per year)

1 Cr

20 Cr

40 Cr

14 Cr

Mission statement of the         Initiative

Lead the Language Technology Movement in the country

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Datasets, Models, Tools

Boost the Indian academic ecosystem

Support to startups awaiting Indian language technology

Better e-governance and dissemination

Magnify reach and impact in social sector

By 2021 75% of Indian internet users will use Indian languages*

- 22 official languages

- 1.3 Billion speakers  

Why is this a worthy mission?

*Source Google-KPMG report https://assets.kpmg/content/dam/kpmg/in/pdf/2017/04/Indian-languages-Defining-Indias-Internet.pdf

Rich demand for speech and text technology

કેમ છો

कैसे हैं

Chat applications

Digital entertainment

Social media platforms

Digital

news

Digital write-ups

Digital payments

e-governance services

e-commerce services

Confidential

Why is this problem hard?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

Scale and Diversity

Unique language phenomenon

mujhe bahut confusion hai

Scarcity of resources

Lack of basic speech and NLP tools

Named Entity Recognition

Sentiment Analysis

Topic Classification

Content Filters

Keyboards

Spell checkers

Longer utterances

हि

EN

Why is this problem hard?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Tasks

Languages

Domains

Multiplicity of languages, tasks and domains

... ... ...

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

En

Why won't a startup solve this?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Tasks

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

Languages

Domains

Developing the full stack for multiple languages in multiple domains is the key!

En

... ... ...

Why won't a startup solve this?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Tasks

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

Languages

Domains

Developing the full stack for multiple languages in multiple domains is the key!

En

... ... ...

Why won't a startup solve this?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Tasks

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

Languages

Domains

Developing the full stack for multiple languages in multiple domains is the key!

En

... ... ...

Why are we uniquely positioned to solve this problem?

Proven academic track record

100+ papers

30+ patents

Delivery oriented industry experience

Unique blend of research, engineering and systems skills

Research Excellence Awards

How will we ensure adoption and impact?

Partner with GOI's National Language Technology Mission

Partner with NGOs to enable last mile delivery/adoption of tools

Conduct biannual "Get started with Indian Language Technology" workshops for startups, academia and industry

Made with Slides.com