Fusing representations for code-mixed low resource language processing

Problem?

Low resource language data being generated at a large scale now.
Major work has been done only for English-x language code-mixed language models.
Untapped potential of low resource to low resource code-mixed language data.

Notable Work

Dravidian-CodeMix - FIRE 2021
GCM: A Toolkit for Generating Synthetic Code-mixed Text (Rizvi et al. 2021)
Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages
(Madasamy et al. 2021)

Proposed Solution

Generate data using GCM

GANFusion to fuse representations

Extract representations and use for a demo task

Repeat for multiple language combinations

What is GANFusion?

Architecture from https://arxiv.org/pdf/2105.01129.pdf

Architecture from "Towards A Multi-agent System for Online Hate Speech Detection (Sahu et al. (2021))"

Pre-processing for the z-vector

Text cleaning using Ekphrasis
POS tagging to create a "(subject, object, verb, modifier)" pointer
is passed through a LSTM+Word Attention model to create the input to GAN:

p_t

z_t

p_t

Impact?

Low resource code-mixed language processing
Transfer Learning to English-x code-mixed tasks
Social computing based Computational Social Science Tasks
Future tasks may include probing these models for possible learned biases

Thank you!

unicode-research-presentation

By deep1401

unicode-research-presentation

Presentation for Unicode Research project

558

deep1401