Language Technology for Social Good

Mitesh M. Khapra

Nilekani Centre at AI4Bharat, IIT Madras

मी तुम्हाला प्रतिकिलो 100 रुपये देऊ शकतो

100 கிலோ அரிசியை விற்க விரும்புகிறேன்

What would it take for a farmer in TN to talk to a wholesaler in Maharashtra?

We need to solve four fundamental problems

Speech Recognition

Language Understanding

Machine Translation

Speech Synthesis

Current setup

 

100s of calls daily in multiple languages manually answered by dedicated personnel

 

 

Challenges

 

Cannot scale effectively, some calls will not be answered, unnecessary repetition of work

 

How language technology can help?

 

Automatic processing of calls using speech and language technology

આજે બજારમાં મકાઈના ભાવ કેટલા છે?

 

 

Named Entity Recognition, Intent Identification, Natural Language Generation .....

Current setup

 

100s of books being translated manually into multiple languages (including Gondi, Santali, etc.)

Challenges

 

Requires skilled translators, lack of proper input and authoring tools 

How language technology can help?

 

Good input tools, authoring tools and translation tools especially for under represented languages can help to achieve better scale

Transliteration, Translation, Spell check

santali

1.

2.

3.

Romanized keyboard

मक्का सुसतु झाला

I am tired

Konkani

Current setup

 

Surveys of 100s of families done manually in semi-urban and  rural areas in multiple languages

Challenges

 

The process requires manual data entry which is time consuming and hence limits the number and length of surveys

How LT can help?

 

Record conversations and then use speech and language technology to extract entities and relations for automatic form-filling

माझ्या वडिलांना हृदयरोग झाला होता पण माझ्या आईला ते नव्हते

 

 

Information Extraction, Slot-filling

 

heart_condition (father, yes)

heart_condition (mother, no)

Commercial applications

Multilingual chatbots

Sentiment Analysis

Content Moderation

Code mixed song search

Speech

QA

Multilingual Authoring Tools

Confidential

By 2021 75% of Indian internet users will use Indian languages*

- 22 official languages

- 1.3 Billion speakers  

Rich diversity

*Source Google-KPMG report https://assets.kpmg/content/dam/kpmg/in/pdf/2017/04/Indian-languages-Defining-Indias-Internet.pdf

Rich demand for speech and text technology

, growth

 and demand

કેમ છો

कैसे हैं

Chat applications

Digital entertainment

Social media platforms

Digital

news

Digital write-ups

Digital payments

e-governance services

e-commerce services

Confidential

Why is this problem hard?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

বা

हि

తె

कॉ

ગુ

ने

कों

सं

ਪੰ

सिं

اُر

मै

Scale and Diversity

Unique language phenomenon

mujhe bahut confusion hai

Scarcity of resources

Lack of basic speech and NLP tools

Named Entity Recognition

Sentiment Analysis

Topic Classification

Content Filters

Keyboards

Spell checkers

Indian languages lag on academic benchmarks

Speech Recognition

Language Understanding

Machine Translation

Speech Synthesis

Summary: LT is foundational for the social sector?

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Indian language + voice support =
Key to interface Bharat

Open source all datasets and AI models with permissible licenses

Build a coalition of partners to collect datasets and deploy models

Creating solutions based on local needs and behaviour is critical to improving user engagement

 

Consumer Survey,
Feb 2018, Bain Analysis

Voice could play a
pivotal role in enabling
e-governance and
bringing next 300 million
Indians to digital
platform

Nasscom Survey,

2019

 Indian language users are expected to account for 75% of  India' internet user base

KPMG/Google Survey,

2018

Chat & entertainment

Social media & news

Digital write-ups, payments, e-governance, e-commerce, search

Summary: LT can unlock commercial value?

What should we aim to do?

Bring parity with English 
in AI tech for Indian languages 
with open source contributions

Be the Apache for Indian languages AI stack

Have we solved parts of this problem?

Speech Recognition

Language Understanding

Machine Translation

Speech Synthesis

Mined 33 million
parallel sentences

Built billion parameter translation models

Our models for En to 11 Indian languages beat
all models (including
Google, Microsoft)

Who is using our translation models?

NGO for book translation

Govt. for website translation

Fiction book translation

Judgment
translation

Feedback from users

 

“The quality of translations is significantly improved. I would say this is more so for the legal document where there were complicated sentences/cards which were translated very well. The syntax mostly did not falter even in the face of multiple ideas/information contained in one sentence.”
 

“The amount of time spent on correcting/improving the translation has dropped.”
 

“THIS IS VERY PROMISING. AMAZED BY THE SPEED.”

 

“I TOOK A PRINTOUT AND WENT THROUGH EVERY LINE. THE TRANSLATION IS 98% ACCURATE AND HIGHLY SATISFIED.”

What should we do in the next 5 years?

Language Understanding

Machine Translation

Speech Synthesis

Speech Recognition

Create and release datasets

Train and open source AI models

Build deployable tools

Outreach to
drive use

Driving Science

Driving Adoption

Your Path to Language Technology

Python Packages

Prog. & Databases

Descriptive Statistics

Probability Theory

Inferential Statistics

Statistical Modelling

Functions

Calculus

Linear Algebra

Probability Theory (Adv.)

Optimisation

Machine Learning

Deep Learning

Pre-requisites

Foundations of DS

Foundations of ML

ML

DL

Information Theory

Language Technology

Associate Professor, IIT Madras
PhD, IIT Bombay

 

Mitesh M. Khapra

Pratyush Kumar

Anoop Kunchukuttan

Researcher, Microsoft Research
Adjunct Professor, IIT Madras
PhD, ETH Zürich
 

Researcher, Microsoft
PhD, IIT Bombay

 

100+ academic papers

30+ US patents

Recognized by

Experience

Team

Research expertise

Academic, industry, and startup experience

Language Technology for Social Good

By One Fourth Labs

Language Technology for Social Good

The AI4Bharat Initiative

  • 339