@ IIT Madras

The AI4Bharat Initiative

Mission statement of the Initiative

Bring parity in AI technology for Indian languages with English

Make fundamental contributions to state-of-the-art across language technologies - NLP, Speech, Sign language, OCR

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Associate Professor, IIT Madras
PhD, IIT Bombay
Areas - NLP, Deep Learning

Mitesh M. Khapra

Pratyush Kumar

Our Team

Anoop Kunchukuttan

Assistant Professor, IIT Madras
PhD, ETH Zürich
Areas - Deep Learning, Systems

Researcher, Microsoft
PhD, IIT Bombay
Areas - NLP

+ many hard-working students and volunteers

What have we done so far (last 18 months)

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

বা

ગુ

हि

ಕ

म

ਪੰ

த

తె

ଓ

മ

অ

বা

ગુ

हि

ಕ

म

ਪੰ

த

తె

ଓ

മ

অ

Corpora, benchmarks, models for 11 Indic languages

Romanized keyboards for under-represented languages

Datasets and efficient models for isolated Indian Sign Language

Parallel corpus, translation models between English & 11 Indic languages

What have we done so far?

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Largest task-agnostic monolingual corpus

Pre-trained
language models

Benchmark set on NLU tasks

9 billion tokens

14 tasks

10,000 downloads

Impact: Indian startups and academia can now use large pre-trained models to support downstream tasks such as sentiment analysis, question answering, semantic matching, etc in 11 Indic languages

Across 11 Indian languages:

বা

ગુ

हि

ಕ

म

ਪੰ

த

తె

ଓ

മ

অ

\dots

T_1

T_N

T_{[SEP]}

T_1'

T_M'

\dots

[CLS]

E_1

E_N

E_{[SEP]}

E_1'

E_M'

\dots

[CLS]

Tok1

Tok N

[SEP]

Tok 1

TokM

Masked Sentence 1

Masked Sentence 2

What have we done so far?

User types using English script

Automatically converted to Maithili script

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Impact: Input tools become significantly more efficient for a long list of Indian languages. Impacts all content creation including writing storybooks for children

Deployed to write storybooks at

What have we done so far?

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Impact: The first large-scale AI work on Indian Sign Language. Won the first AI4Accessibility global challenge run by Microsoft for scaling up data collection to continuous sign language through 2021-22

Largest public dataset on Indian Sign Language

263 signs, 4000+ videos

Efficient AI models for recognising signs

>90% accuracy

What have we done so far?

AI model for translating between English and 12 Indic languages

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

46M parallel sentences mined from the web (3X improvement)

Impact: The translation models (which are shown to be more accurate than commercial APIs) are being used to assist human translators in translating supreme court judgements with a significant increase in efficiency

What are some of our ongoing activities?

Creating a robust, standardised benchmark for Machine Translation

Mining high quality text to speech data for building ASR models

Building models for continuous signs in video call apps

What next? (A long tail of NLP tasks)

What does it take to build a typical NLP application?

Full NLP stack

A chatbot for Aaarogya Setu

Input Tools

Keyboards

Spell checkers

.... ....

Text Analysers

.... ....

Sentiment Analysis

Content Filters

Inference Engines

QA

Natural Language Inference

.... ....

Text Generators

.... ....

Dialog

Build this full stack for each Indic language

Text Generators

.... ....

Translation

Dialog

Summarisation

Inference Engines

QA

NLI

Paraphrase Detection

.... ....

Text Analysers

.... ....

Named Entity Recognition

Sentiment Analysis

Topic Classification

Content Filters

Input Tools

Keyboards

Spell checkers

.... ....

10,000 downloads

Impact

Mining translation pairs

Mining transliteration pairs

Building ASR systems

Building Input Tools

v 2.0

Sep'21

বা

ગુ

हि

ಕ

म

ਪੰ

த

తె

ଓ

മ

অ

IndicNLU Challenge: Benchmark for Indian languages

Mar'22

Monolingual Corpus v 1.0

\dots

T_1

T_N

T_{[SEP]}

T_1'

T_M'

\dots

[CLS]

E_1

E_N

E_{[SEP]}

E_1'

E_M'

\dots

[CLS]

Tok1

Tok N

[SEP]

Tok 1

TokM

Masked Sentence 1

Masked Sentence 2

Pre-trained Model

Sep'20

Track 1: Basic Building Blocks

Track 2: Input Tools

(Potential) Impact

Creating stories

Dec'20

Project Karya

(Indian Amazon Mechanical Turk)

Government crowdsourcing initiatives

Build a dataset for all 22 constitutionally recognised languages

Build a joint transliteration model for all 22 languages

Build romanised keyboards for all 22 languages

Build swipe based keyboards for all 22 languages

Dec'21

Mar'22

Jun'22

Sep'22

Track 3: Machine Tranlsation

Impact

SUVAS

NPTEL translations

Universal Language Contribution API

IndicTranslate Challenge: A benchmark for 11 Indic languages

Build swipe based keyboards for all 22 languages

Dec'21

(Translation of legal documents)

বা

ગુ

हि

ಕ

म

ਪੰ

த

తె

ଓ

മ

অ

?

46M parallel sentences

Joint translation model competing with commercial systems

Apr'21

বা

ગુ

हि

ಕ

म

ਪੰ

த

తె

ଓ

മ

অ

75M parallel sentences, 15 languages

Jun'22

A joint model for Indic-Indic translation v2.0

Mar'23

Track 4: Automatic Speech Recognition

Build swipe based keyboards for all 22 languages

Joint pre-trained models for 22 Indian languages

Sep'21

Dec'21

Mine and align speech-to-text parallel data for 22 Indian languages

ASR system for NPTEL videos

Dec'22

Dec'23

Speech interface for input tools

Track 4: Indian Sign Language Detection

Build swipe based keyboards for all 22 languages

2021

2022

2023

2020

Dataset

Benchmark

Model

Tools

green = eco-friendly = efficient

v 1.0, 9B tokens, 12 languages

v 1.0, 2 languages

v 1.0, 46M sentences, 12 languages

v 2.0, 20B tokens, 15 languages

v 1.0, 22 languages

v 2.0, 75M sentences, 15 languages

v 3.0, 25B tokens, 22 languages

v 1.0, 2000 hours, 22 languages

v 1.0, 22 languages

v 3.0, 15 languages, green translation

green language model

v 2.0, 4000 hours, 22 languages, green ASR Models

263 signs, 4000+ videos

swipe based keyboards, 22 languages

integration with video calling apps

Beyond 2023

Build swipe based keyboards for all 22 languages

What will we do?

A long tail of tasks and languages

... ... ...

What will be our guiding principle?

Focus on delivery-oriented, cutting edge research which is relevant to government, industry, academia and society

How will we sustain this?

Depend on a core set of mentors (Kris, EkStep) to broaden the reach of our work in sectors with a definitive need (government, industry, startups)

Bundelkhandi, Gondi, Garhwali, ... ...

What do we need?

Set up "The AI4Bharat Initiative" @

Yearly budget
1M USD

House a best-in-class team of researchers, engineers and developers

Build capacity by conducting workshops for startups, industry and academia

Rent/set up world class compute infrastructure for building cutting edge AI models

Build datasets and benchmarks by partnering with annotation service providers

@ IIT Madras

The AI4Bharat Initiative

Mission statement of the Initiative

Bring parity in AI technology for Indian languages with English

Make fundamental contributions to state-of-the-art across language technologies - NLP, Speech, Sign language, OCR

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Associate Professor, IIT Madras PhD, IIT Bombay Areas - NLP, Deep Learning

Mitesh M. Khapra

Pratyush Kumar

Our Team

Anoop Kunchukuttan

Assistant Professor, IIT Madras PhD, ETH Zürich Areas - Deep Learning, Systems

Researcher, Microsoft PhD, IIT Bombay Areas - NLP

+ many hard-working students and volunteers

What have we done so far (last 18 months)

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Corpora, benchmarks, models for 11 Indic languages

Romanized keyboards for under-represented languages

Datasets and efficient models for isolated Indian Sign Language

Parallel corpus, translation models between English & 11 Indic languages

What have we done so far?

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Largest task-agnostic monolingual corpus

Pre-trained language models

Benchmark set on NLU tasks

9 billion tokens

14 tasks

10,000 downloads

Impact: Indian startups and academia can now use large pre-trained models to support downstream tasks such as sentiment analysis, question answering, semantic matching, etc in 11 Indic languages

Across 11 Indian languages:

Masked Sentence 1

Masked Sentence 2

What have we done so far?

User types using English script

Automatically converted to Maithili script

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Impact: Input tools become significantly more efficient for a long list of Indian languages. Impacts all content creation including writing storybooks for children

Deployed to write storybooks at

What have we done so far?

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Impact: The first large-scale AI work on Indian Sign Language. Won the first AI4Accessibility global challenge run by Microsoft for scaling up data collection to continuous sign language through 2021-22

Largest public dataset on Indian Sign Language

263 signs, 4000+ videos

Efficient AI models for recognising signs

>90% accuracy

What have we done so far?

AI model for translating between English and 12 Indic languages

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

46M parallel sentences mined from the web (3X improvement)

Impact: The translation models (which are shown to be more accurate than commercial APIs) are being used to assist human translators in translating supreme court judgements with a significant increase in efficiency

What are some of our ongoing activities?

Creating a robust, standardised benchmark for Machine Translation

Mining high quality text to speech data for building ASR models

Building models for continuous signs in video call apps

What next? (A long tail of NLP tasks)

What does it take to build a typical NLP application?

Full NLP stack

A chatbot for Aaarogya Setu

Input Tools

Keyboards

Spell checkers

.... ....

Text Analysers

.... ....

Associate Professor, IIT Madras
PhD, IIT Bombay
Areas - NLP, Deep Learning

Assistant Professor, IIT Madras
PhD, ETH Zürich
Areas - Deep Learning, Systems

Researcher, Microsoft
PhD, IIT Bombay
Areas - NLP

Pre-trained
language models

Yearly budget
1M USD