@ IIT Madras

The AI4Bharat Initiative

Mission statement of the         Initiative

Bring parity in AI technology for Indian languages with English

Make fundamental contributions to state-of-the-art across language technologies - NLP, Speech, Sign language, OCR

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Associate Professor, IIT Madras
PhD, IIT Bombay
Areas - NLP, Deep Learning

 

Mitesh M. Khapra

Pratyush Kumar

Our Team

Anoop Kunchukuttan

Assistant Professor, IIT Madras
PhD, ETH Zürich
Areas - Deep Learning, Systems

 

Researcher, Microsoft
PhD, IIT Bombay
Areas - NLP

+ many hard-working students and volunteers

What have we done so far (last 18 months)

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

বা

ગુ

हि

ਪੰ

తె

E

বা

ગુ

हि

ਪੰ

తె

Corpora, benchmarks, models for 11 Indic languages
 

Romanized keyboards for under-represented languages
 

Datasets and efficient models for isolated Indian Sign Language

Parallel corpus, translation models between English & 11 Indic languages

What have we done so far?

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Largest task-agnostic monolingual corpus

Pre-trained
language models

Benchmark set on NLU tasks

9 billion tokens

14 tasks

10,000 downloads

Impact: Indian startups and academia can now use large pre-trained models to support downstream tasks such as sentiment analysis, question answering, semantic matching, etc in 11 Indic languages

Across 11 Indian languages:

বা

ગુ

हि

ਪੰ

తె

\dots
\dots
C
T_1
T_N
T_{[SEP]}
T_1'
T_M'
\dots
\dots
[CLS]
E_1
E_N
E_{[SEP]}
E_1'
E_M'
\dots
\dots
[CLS]
Tok1
Tok N
[SEP]
Tok 1
TokM

Masked Sentence 1

Masked Sentence 2

What have we done so far?

User types using English script

Automatically converted to Maithili script

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Impact: Input tools become significantly more efficient for a long list of Indian languages. Impacts all content creation including writing storybooks for children

Deployed to write storybooks at

What have we done so far?

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

Impact: The first large-scale AI work on Indian Sign Language. Won the first AI4Accessibility global challenge run by Microsoft for scaling up data collection to continuous sign language through 2021-22

Largest public dataset on Indian Sign Language

263 signs, 4000+ videos

Efficient AI models for recognising signs

>90% accuracy

What have we done so far?

AI model for translating between English and 12 Indic languages

IndicNLPSuite

Input Tools

INCLUDE

Samanantar

46M parallel sentences mined from the web (3X improvement)

Impact: The translation models (which are shown to be more accurate than commercial APIs) are being used to assist human translators in translating supreme court judgements with a significant increase in efficiency

What are some of our ongoing activities?

Creating a robust, standardised benchmark for Machine Translation

Mining high quality text to speech data for building ASR models

Building models for continuous signs in video call apps

What next? (A long tail of NLP tasks)

What does it take to build a typical NLP application?

Full NLP stack

A chatbot for Aaarogya Setu

Input Tools

Keyboards

Spell checkers

.... ....

Text Analysers

.... ....

Sentiment Analysis

Content Filters

Inference Engines

QA

Natural Language Inference

.... ....

Text Generators

.... ....

Dialog

Build this full stack for each Indic language

Text Generators

.... ....

Translation

Dialog

Summarisation

Inference Engines

QA

NLI

Paraphrase Detection

.... ....

Text Analysers

.... ....

Named Entity Recognition

Sentiment Analysis

Topic Classification

Content Filters

Input Tools

Keyboards

Spell checkers

.... ....

10,000 downloads

Impact

Mining translation pairs

Mining transliteration pairs

Building ASR systems

Building Input Tools

v 2.0

Sep'21

বা

ગુ

हि

ਪੰ

తె

IndicNLU Challenge: Benchmark for Indian languages

Mar'22

Monolingual Corpus v 1.0

\dots
\dots
C
T_1
T_N
T_{[SEP]}
T_1'
T_M'
\dots
\dots
[CLS]
E_1
E_N
E_{[SEP]}
E_1'
E_M'
\dots
\dots
[CLS]
Tok1
Tok N
[SEP]
Tok 1
TokM

Masked Sentence 1

Masked Sentence 2

Pre-trained Model

Sep'20

Track 1: Basic Building Blocks

Track 2: Input Tools

(Potential) Impact

Creating stories

Dec'20

Project Karya

(Indian Amazon Mechanical Turk)

Government crowdsourcing initiatives

Build a dataset for all 22 constitutionally recognised languages 

Build a joint transliteration model for all 22 languages 

Build romanised keyboards for all 22 languages 

Build swipe based keyboards for all 22 languages 

Dec'21

Mar'22

Jun'22

Sep'22

Track 3: Machine Tranlsation

Impact

SUVAS

NPTEL translations

Universal Language Contribution API

IndicTranslate Challenge: A benchmark for 11 Indic languages

Build swipe based keyboards for all 22 languages 

Dec'21

(Translation of legal documents)

E

বা

ગુ

हि

ਪੰ

తె

?

46M parallel sentences

Joint translation model competing with commercial systems

Apr'21

বা

ગુ

हि

ਪੰ

తె

75M parallel sentences, 15 languages

Jun'22

A joint model for Indic-Indic translation v2.0

Mar'23

Track 4: Automatic Speech Recognition

Build swipe based keyboards for all 22 languages 

Joint pre-trained models for 22 Indian languages

Sep'21

Dec'21

Mine and align speech-to-text parallel data for 22 Indian languages 

ASR system for NPTEL videos

Dec'22

Dec'23

Speech interface for input tools

Track 4: Indian Sign Language Detection

Build swipe based keyboards for all 22 languages 

Build swipe based keyboards for all 22 languages 

2021

2022

2023

2020

Dataset

Benchmark

Model

Tools

green = eco-friendly = efficient

v 1.0, 9B tokens, 12 languages

v 1.0, 2 languages

v 1.0, 46M sentences, 12 languages

v 2.0, 20B tokens, 15 languages

v 1.0, 22 languages

v 1.0, 22 languages

v 2.0, 75M sentences, 15 languages

v 3.0, 25B tokens, 22 languages

v 1.0, 2000 hours, 22 languages

v 1.0, 22 languages

v 3.0, 15 languages, green translation 

green language model

v 2.0, 4000 hours, 22 languages, green ASR Models

263 signs, 4000+ videos

swipe based keyboards, 22 languages

integration with video calling apps

Beyond 2023

Build swipe based keyboards for all 22 languages 

What will we do?

A long tail of tasks and languages

... ... ...

What will be our guiding principle?

Focus on delivery-oriented, cutting edge research which is relevant to government, industry, academia and society

How will we sustain this?

Depend on a core set of mentors (Kris, EkStep) to broaden the reach of our work in sectors with a definitive need (government, industry, startups)

Bundelkhandi, Gondi, Garhwali, ... ...

What do we need?

Set up "The AI4Bharat Initiative"  @

Yearly budget
1M USD

House a best-in-class team of researchers, engineers and developers

Build capacity by conducting workshops for startups, industry and academia

Rent/set up world class compute infrastructure for building cutting edge AI models

Build datasets and benchmarks by partnering with annotation service providers

Yearly cost

Yearly budget: 915K USD

1 Post Doc

2 Senior Researchers

15 AI Residency Fellows

RnD

Outsourced (200 man months)

Labelling

30K

40K

150K

20K

15K

25K

120K

60K

250K

Physical Infrastructure

Space

Cloud Infrastructure

Compute

2 Software Engineers

1 Admin Head

1 System Admin

5 Support Staff

Admin

1 Outreach Lead

Outreach

Workshops/Contests

100K

Equipmet

Laptops/Storage

50K

40K

15K

Copy of AI4Bharat Initiative

By One Fourth Labs

Copy of AI4Bharat Initiative

The AI4Bharat Initiative

  • 440