Димитрина

Златкова

(ИИ)

Даниел

Копев

(ИИ)

Атанас

Атанасов

(ИИ)

SemEval-2018

Emoji Prediction

@

So this happened :) with my girls @user and erinblonshine. ️ #29 #sacredhearttattoo @ Sacred…

Data:

Label:

red_heart

two_hearts

blue_heart

purple_heart

camera_with_flash

camera

Classes

Our Plan

  1. Data
  2. Text Preprocessing
  3. Feature Engineering
  4. Classification
  5. Evaluation

Text Processing

So this happened :) with my girls @user and erinblonshine. ️ #29 #sacredhearttattoo @ Sacred…

Original:

Text Processing (2)

So this happened __smile__ with my girls @user and erinblonshine. ️ #29 #sacredhearttattoo @ Sacred…

Pattern replace:

['so', 'this', 'happened', '__smile__', 'with', 'my', 'girls', 'and', 'erinblonshine', 'sacred', 'heart', 'tat', 'too', 'sacred']

Tokenize:

So this happened __smile__ with my girls  and erinblonshine#sacredhearttattoo   Sacred

Char filter:

Text Processing (3)

['happened', 'smile', 'girls', 'erinblonshine', 'sacred', 'heart', 'tat', 'sacred']

Stop words:

['happen', 'smile', 'girl', 'erinblonshine', 'sacred', 'heart', 'tat', 'sacred']

Lemmatize:

[('so','RB'),('this','DT'),('happened','VBD'),('smile','NN'), ('with','IN'),('my','PRP$'), ('girls','NNS'),('and','CC'), ('erinblonshine','NN'),('sacred','JJ'),('heart','NN'),('tat','VB'),('too', 'RB'),('sacred','VBD')]

POS Tagger:

Vectorization

  • tf-idf
  • GloVe
  • word2vec
  • FB research (StarSpace)

Feature Engineering

Text Features

  • words
  • unique words
  • stopwords
  • @user
  • Words Title
  • WORDS UPPER
  • mean word length
  • #
  • a b c
  • 1 2 3
  • $
  • %
  • !!!
  • ???

Emotions

  • anger
  • sadness
  • disgust
  • fear
  • joy
  • anticipation
  • surprise
  • trust

- fun, sun

- fun, sun

- sun

- sun

Colors

  • black
  • blue
  • brown
  • green
  • grey
  • orange
  • pink
  • purple
  • red
  • white
  • yellow

- sun

- sun

- park

Sentiment

Positive:

Negative:

"pos_0", "pos_.15", "pos_.20", "pos_.27", "pos_.4", "pos_above"

"neg_0", "neg_.15", "neg_.25", "neg_.35", "neg_.6", "neg_above"

Hierarchical Twitter Clusters

^010011000 got qot gott g0t gotz qott gottt gawt ghot gotcho goht ggot
^111010100010 lmao lmfao lmaoo lmaooo lool rofl loool lmfaoo lmfaooo lmaoooo
^111010100011 haha hahaha hehe hahahaha hahah aha hehehe ahaha hah hahahah hahaa ahah

Experimental Results

Precision Recall F1 Macro
Multinomial Naive Bayes 0.05 0.21 1.763
Logistic Regression with L-BFGS 0.22 0.28 13.16
MLP, 2 hidden layers, ReLU 0.26 0.26 ​17.898
Random Forest (50 estimators) 0.20 0.26 16.167
SVM, tf-idf 0.23 0.27 19.554
SVM, Twitter embeddings 0.16 0.18 8.522
AdaBoost, Extra Tree base 0.15 0.19 7.825
SVM+AdaBoost+Random Forest 0.25 0.24 13.764
SVM+AdaBoost+MLP 0.25 0.28 20.106

(10k train, 1k test)

Deep Learning

CNN for Text Classification

LSTM Networks

Hierarchical Attention Neural Networks for Text Classification

Experimental Results

Precision Recall F1 Macro
CNN 0.15 0.14 12.034
RNN with LSTM 0.24 0.17 13.106
HANN 0.30 0.13 15.999

(10k train, 1k test)

Final Results

Precision Recall F1 Macro
SVM, tf-idf 0.30 0.33 23.3
HANN 0.30 0.13 22.518

(488k train, 50k test)

 valentine, loveofmylife, heart full, heart

cool kid, sunglasses, coolin, shade, cool, sunglass

ti season, christmastree, tree, christmas tree, merry christmas, merry, christmas

pretty pink, breast, pink, breast cancer

daze, beachin, sunshine state, fun sun, sunny day, sun, sunny, sunshine

veteran day, murica, veteran, america, ivoted, election, merica, vote, usa

23.3

Leaderboard

Thank you!

Made with Slides.com