An AI4BHARAT INITIATIVE

indicnlp.ai4bharat.org

A HUB FOR INDIC NLP RESOURCES, DATASETS AND TOOLS

Why IndicNLP?

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

বা ગુ हि ಕ മ म ने ਪੰ த తె ار

The social view

Touch Points: Digital

Users: Multlingual

Language technology is absolutely essential to magnify reach and impact in the social sector

The commercial view

Multilingual chatbots

Sentiment Analysis

Content Moderation

Code mixed song search

Speech

QA

Multilingual Authoring Tools

... And the demand for these tools is increasing

*Source Google-KPMG report https://assets.kpmg/content/dam/kpmg/in/pdf/2017/04/Indian-languages-Defining-Indias-Internet.pdf

કેમ છો

कैसे हैं

Chat applications

Digital entertainment

Social media platforms

Digital

news

Digital write-ups

Digital payments

e-governance services

e-commerce services

22 official languages 1.3 billion speakers

By 2025, 75% of Indian internet users will use Indian languages 

Demand for speech and text technology

Rich diversity

, growth

and demand

... But we are far behind

Poor in resources

, tools

and technology

Very few Wikipedia articles

No indigenous input tools

Poor speech and lang. technology

The need for Indian NLP is clear!

The fact that we have not fully succeeded yet is also clear!

How do we find the secret of success?

The Indian NLP story

The (not-so-secret) recipe of English NLP

Collect huge amount of task agnostic unsupervised data

Pre-train a model

Create a benchmark of NLU tasks

Fine-tune and track progress

This is where IndicNLP is lagging 

Curate: Curate datasets for Indian languages

Innovate: Create a common evaluation platform for tracking progress across multiple tasks in multiple languages

Build: Create an ecosystem of stakeholders to create and deploy solutions for the country at large

The way forward

Authors: Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

 IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. Findings of EMNLP (EMNL-Findings), 2020

A small step

A large scale task agnostic monolingual corpus

A pre-trained planet warmer (a.k.a. ALBERT)

A benchmark of NLU tasks

Contribute and track progress

বা

हि

Main Message

Invest in creating Indian language datasets. Today!

If we are serious about succeeding, this is the only way forward!

Serving under represented languages

Let's not forget them!

Assistant Professor, IIT Madras
BTech from IIT Bombay, PhD from ETH Zurich
Worked at IBM research
DL consultant for startups
35 research papers, 18 patents

Assistant Professor, IIT Madras
PhD from IIT Bombay              
Exp in teaching DL to industry and academia,

5 years exp at IBM research,
40+ research papers
Google Faculty Award, Young Faculty Recognition Award

Dr. Mitesh M. Khapra

Dr. Pratyush Kumar

Launched in Jul 2019:    Working on several open-source projects of social importance in AI

Founding team (indicnlp.ai4bharat.org)

About Us

Senior Applied Researcher, Microsoft India,
PhD from IIT Bombay
Exp. in Machine Translation, Multilingual NLP, Building tools and resources for Indian NLP
2+ years experience at Microsoft Translator group
35+ research papers

Dr. Anoop Kunchukuttan

Contact Us

Let's make India ready for the AI age

Anoop Kunchukuttan 
Mitesh M. Khapra
Pratyush Kumar
anoop.kunchukuttan@gmail.com
miteshk@cse.iitm.ac.in
pratyush@cse.iitm.ac.in

IndicNLP - Vaibhav Summit

By One Fourth Labs

IndicNLP - Vaibhav Summit

A hub for Indic NLP resources, datasets and tools

  • 403