Harnessing Character-level Dynamics and Word-level Semantics: A New Perspective on Twitter Sentiment Analysis

about the dataset

Existing works

Experiment 1

Use DMD Embeddings to classify based on ML

DMD in Word Embeddinsg

The dynamics captured by the Dynamic Mode Decomposition (DMD) correspond to the semantic relationships between characters within sentences. By analyzing the transitions from one character to another, DMD identifies and represents the underlying dynamics that convey meaning and semantic connections. These dynamics encapsulate the relationships between characters and contribute to the overall semantic understanding of the sentence.

Simple Graph Construction

Nodes are Characters

Edges are based on Character Co-occurance

Update the edge weight if Co-ocurance already exists else add edge

Fit the Co-Occurrence matrix to get the DMD Modes

Take the highest mode

Results

Experiment 2

Message Passing

Intiution

Spike a Graph and Edges denote the way the signal can transfer through the graph

How the signal Spreads through the graph is the Message passing

A is Adjacency Matrix

D is Degree Of Connections

Pass to GCN

Experiment 3

From Characters to words

AFFIN

example

Lexicon-based approach for text scoring

The AFINN lexicon, consists of over 3,300 English words or phrases, each assigned a sentiment score ranging from -5 to +5. The sentiment score indicates the degree of positive or negative sentiment associated with a particular word or phrase.

Words with positive sentiment Positive Score

VADER

Valence Aware Dictionary and sentiment Reasoner

Valence Aware Dictionary and sEntiment Reasoner

rule-based sentiment analysis tool

Comes with a pre-trained model that includes a lexicon of words and their associated sentiment scores.

VADER assigns sentiment scores to individual words based on their position in the lexicon and their contextual usage. These scores are then combined to calculate an overall sentiment score for a given text

-1 (extremely negative) to +1 (extremely positive)

GLOVE

Glove model combines word-word co-occurrence counts and window based approaches

pre-trained model with 6 billion english words from stanford

Glove

Objective Function

Pij=The observed co-occurrence count between word i and word j in the corpus.

For each pair of words that can co-occur, minimize the distance between the inner product of two words and the log count of two words

Squared distance value is weighted using f(Pij) which will assign lower weights to the pair having high count

If two words frequently co-occur in similar contexts, their inner product should be high. On the other hand, if their co-occurrence count is high, their logarithm should also be high. By minimizing the difference between the two, GloVe effectively learns word vectors that accurately capture the relationships between words based on their co-occurrence patterns.

Graph Construction

"I am having headache"

having

headache

k=1

Graph Construction

"I am having headache"

having

headache

k=1

Each word will be connected to "k-nearest words"

GCN

Embedding

Pooling

Text

Logits

Results

import re
import string
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from afinn import Afinn
from nltk.sentiment import SentimentIntensityAnalyzer
from gensim.models import Word2Vec
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
import torch
import torch.nn.functional as F

# Load the data
df = pd.read_csv("/content/twitter_validation.csv", usecols=[2, 3], names=['emotion', 'tweet'], header=0)

# Lowercase the tweets
df['tweet'] = df['tweet'].str.lower()

# Remove URLs
df['tweet'] = df['tweet'].apply(lambda x: re.sub(r'https?:\/\/\S+', '', x))

# Remove the punctuation
df['tweet'] = df['tweet'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

# Remove stopwords and keep words with length > 3
stop = stopwords.words('english')
df['tweet'] = df['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop and len(word) > 3]))


from sklearn.preprocessing import LabelEncoder

# Convert emotion labels to numerical labels
label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['emotion'])

# Get the unique numerical labels
unique_labels = df['label'].unique()


# Instantiate sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

# Instantiate AFINN-111
afinn = Afinn()

# Train Word2Vec model
sentences = [word_tokenize(tweet) for tweet in df['tweet']]
# Train Word2Vec model
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)


# Define a function to get sentiment scores using AFFINN
def get_affinn_sentiment(word):
    return afinn.score(word)

# Define a function to get sentiment scores using VADER
def get_vader_sentiment(word):
    sentiment = sia.polarity_scores(word)
    return sentiment['compound']

# Define a function to get Word2Vec embeddings
def get_word2vec_embedding(word):
    if word in word2vec_model.wv:
        return torch.from_numpy(word2vec_model.wv[word])
    else:
        return torch.zeros(word2vec_model.vector_size)

# Prepare the data
df['tweet'] = df['tweet'].apply(lambda x: word_tokenize(x))
df['affinn_sentiment'] = df['tweet'].apply(lambda x: [get_affinn_sentiment(word) for word in x])
df['vader_sentiment'] = df['tweet'].apply(lambda x: [get_vader_sentiment(word) for word in x])
df['word2vec_embedding'] = df['tweet'].apply(lambda x: [get_word2vec_embedding(word) for word in x])

Code

def create_graph(tweet, node_embeddings, label, k=2):
    # Convert node_embeddings and label to tensors
    if len(node_embeddings)!=0:
      x = torch.stack(node_embeddings, dim=0)
      y = torch.tensor([label], dtype=torch.long)

      # Prepare edge_index
      edge_index = []
      for i in range(len(tweet)):
          for j in range(i+1, min(i + k + 1, len(tweet))):
              edge_index.append((i, j))
      edge_index = torch.tensor(edge_index, dtype=torch.long).t().contiguous()

      # Check if the graph is empty
      if len(x) <= 1 or len(edge_index) <= 1:
          return None

      # Construct a PyG Data object
      data = Data(x=x, edge_index=edge_index, y=y)

      return data


graph_data_list = []
for i, row in df.iterrows():
    tweet = row['tweet']
    affinn_sentiment = row['affinn_sentiment']
    vader_sentiment = row['vader_sentiment']
    word2vec_embedding = row['word2vec_embedding']
    label = row['label']
    
  
    graph_data = create_graph(tweet, word2vec_embedding, label)
    
    # Add the graph data object to the list
    if graph_data is not None:
      graph_data_list.append(graph_data)


from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(graph_data_list, test_size=0.2, random_state=42)



import torch.nn.functional as F

def train(model, graph_data_list, criterion, optimizer):
      model.train()
      for data in graph_data_list:
        try:
          optimizer.zero_grad()  # Clear gradients

          # Forward propagation
          outputs =  model(data.x, data.edge_index)
          
          # Convert data.y to one-hot vector
          y_onehot = F.one_hot(data.y, num_classes=4).float()

      
          # Compute loss
          loss = criterion(outputs.squeeze(), y_onehot.squeeze())

          # Backward propagation
          loss.backward()

          # Update weights
          optimizer.step()



          # Update weights
        except Exception as e:
          raise e


def test(model, graph_data_list):
    model.eval()
    correct = 0
    total = 0

    for data in graph_data_list:
      try:
        outputs = model(data.x,data.edge_index)
        predicted = outputs.argmax(dim=1)
        correct += (predicted == data.y).sum().item()
        total += len(data.y)
      except Exception as e:
        raise e

    accuracy = correct / total
    return accuracy



for epoch in range(10):  # Number of epochs
    train(model, train_data, criterion, optimizer)
    test_acc = test(model, test_data)
    print(f"Epoch {epoch+1}, Test Accuracy: {test_acc}")

Code

Overview

By Incredeble us

Overview

2 years ago
95

Harnessing Character-level Dynamics and Word-level Semantics: A New Perspective on Twitter Sentiment Analysis

about the dataset

Existing works

Existing works

Experiment 1

Use DMD Embeddings to classify based on ML

DMD in Word Embeddinsg

Experiment 2

Message Passing

Intiution

Experiment 3

AFFIN

example

VADER

GLOVE

Glove

Graph Construction

Graph Construction

GCN

Results

Overview

More from Incredeble us