2020 U.S.

Presidential

Election Analysis

Final project

CAP6307: Text Mining

by

Dharitrikumari Rathod

Lamia Alshahrani

Yuh Haur Chen

Introduction:

  • Analysis of 2020 Presidential debates and townhall meetings
  • Comparison of 2020 Presidential debates and 2016 presidential debates

Research questions: Method
1. What were the most prominent words said by each candidate during the different stages of debate during the 2020 and 2016 presidential debates? Word cloud
Bigram
2. Which candidate was most talkative during the debate? Statistics
3. How many times did they get interrupted by others? Heatmap
4. What were the positive and negative sentiments of the 2020 debate? Sentiment analysis
5. Which topics were discussed during the 2020 Presidential debates? TF-IDF, Cosine similarity
6. How does the similarity of the candidate’s speech compare to the other candidate on the same topic? TF-IDF, LDA model
7.  How does the similarity between President Trump's speech during 2020 and 2016? TF-IDF, LDA model

 Related work:

Data source

Knowledge

Presidential debate  2016 Presidential debate 2020

Rev.com

David M. Blei

LDA

  1. Sentiment analysis was done for presidential candidates( Indonesia) based on Twitter data using the Naive Bayes algorithm
  2.  TF-IDF algorithm was used for the news document to find the frequency of word/term (TF)and to find the frequency of the term in other documents (IDF)
  3.  

Data Processing:

  • Normalization
  • Tokenization
  • Stopwords
  • custom stopwords
  • Lemmatization

Libraries & Packages & Data cleaning:

Tools: Python 3.8.2, Jupyter notebook

Methods Libraries & Packages
Word cloud NLTK, Pandas, Word-cloud
 Bi-grams NLTK,Pandas
Heatmaps Mathplotlib, Pandas
Sentiment analysis Sentiment Intesity Analyzer, 
 TF-IDF model Scikitlearn, Pandas, NLTK
LDA-BoW Model Gensim, NLTK, Pandas, Spacy, pyLDAvis
LDA-TF-IDF Model Gensim, NLTK, Pandas, Spacy, pyLDAvis

#1 Most prominent words

-  Word Cloud

 Most prominent words: 2020

Biden 2020

Trump 2020

 Most prominent words: 2016

Clinton 2016

Trump 2016

 Most prominent words During First Debate 2020

Biden 1st debate

Trump 1st debate

 Most prominent words During Second Debate 2020

Biden 2nd debate

Trump 2nd debate

 Most prominent words During TownHall Debate 2020

Biden TownHall

Trump TownHall

#1 Most prominent words

-  Bigram

  Most Prominant topics: 2020

 Bigram: Joe Biden

Topics
Green infrastructure
Social security
Mail in ballot
Health insurance/ Affordable care act
First responders
Tax plans

  Most Prominant topics:

 Bigram: Donald Trump

Extracted Topics
New york
Individual mandate
Law enforcement
Stock market
Justice reform
Economy
Small businesses
Oil industry
Forest Management
Obamacare/Health Insuran

 #2 Most talkative in 2020:

Debate_1

Debate_2

#3 Crosstalks

-  Heatmap

Crosstalk/Heat moment

Town Hall Crosstalks

#4 Sentiment analysis

Sentiment Analysis:

1st_debate

Overall 2020

2nd_debate

#5&6 Topic similarity

- TF-IDF

TF-IDF vectorizer

TF-IDF 1st debate

Biden

Trump

TF-IDF 2nd debate

TF-IDF Topics

TFIDF and Cosine similarities

Built the TF-IDF Model to check if the given topic was discussed in the 2020 election or not. Also, compared the topic using cosine similarity at each stage of the Presidential election 2020.

Extracted Topics TFIDF

#5,6,7 Topic similarity

- LDA model

  • Each document is composed of several "Topic"

  • Each topic can be described by several important "words", and the same word can appear between different topics at the same time.

atent

irichlet

llocation

L

D

A

Dictionary Corpus
corpora.Dictionary() Dictionary.doc2bow()
models.TfidfModel()

Dictionary

Corpus

First debate

Combined debate

Second debate

BoW

LDA model 1~3

TF-IDF

LDA model 1~3

Debate1-BoW

Debate1-tfidf

Debate2-BoW

Debate2-tfidf

Debate1+2-BoW

Debate1+2-tfidf

Dictionary

Dictionary

+

+

BoW corpus

tfidf corpus

Bag of words

TF-IDF

 Topic extract

 pyLDAvis

 pyLDAvis

What topics?

BoW TF-IDF
Second, insurance, healthcare, obamacare Industry, child, website, reform, school, business
Industry, business, website, plague, enormous, nuclear, opportunity,  Talk, totally, month, well, police, discredit
Family, Clean, filthy, emission, police, scientist, pollutant, environmental, carbon Family, immigration, political, protest, tremendous, democratic
Ballot, election, inaugural, racist, chance People, million, evidence, healthcare, opposite, condition, fantastic, nuclear
Create, building, question, economic, energy, company Election, chance, separate
Excuse, energy, subsidy, dangerous, federal, border, highway, ecnomically Ballot, number, answer
Statement, plant, pollute, chemical, refinery, superpredator Crosstalk, economy, segment, disaster, vaccine
Crosstalk, Hispanic, environment, global warming, gasoline, unemployment ​Second, dollar, radical, deserve
Dollar, fracke, billion, emission, climate, market Support, energy, federal,

Evaluation LDA Models:  BOW  Vs.  TF-IDF:

BoW

TF-IDF

Topic Similarity

Topic Similarity

Trump

debate1

Trump

debate2

Test doc - Trump

Test doc - Biden

Biden

debate1

Biden

debate2

Test doc - Trump

Trump

2016-1

Trump

2016-2

LDA

Debate 1

LDA

Debate 2

LDA

Debate All

Result

Questions Answers
Word Cloud:The most prominant word in the Presidential debates People, going, president, country, said, opponent
 Compare the topics covered by each candidate at different stages Money, China, insurance, businesses
TF-IDF: Prominant words with weights assigned to it -people, President, ballots people president million, ballot, question, election
Bigrams: Extract more meaningful topics discussed: green infrastructure, forest management  Pandemic, Green infrastructure, Affordable care act, Stock market
Heatmap: Visualize the crosstalks during the debates: debate 1 debate1: 13, debate 2: 7
Word Analysis: Number of sentences spoken during each debates by whom and how many: President Trump, Debate  Debate 1, President Trump

Sentiment Analysis: Based on the words used during the debates

LDA Model: Trained LDA model based on BOW and TF-IDF and compared the scores. LDA model based on TFIDf does better topic modelling.

Future Work:

  • To enhance this work in the future, we may use some advanced Machine Learning algorithms to train our models in order to get a high score of accuracy

  • Regex can be used to remove more unnecessary words/ numbers

  • We can train the TF-IDF model to find the most prominent words in the document.

  • Train LDA on Bigrams to get more accurate topic modeling

Text

i
me
my
myself
we
our
ours
ourselves
themselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
against
between

himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
he
him
his
their
theirs
what
which
who
whom
this
that
that'll
these
those
 
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
am
is
are
was
as
until
while
of
at
by
for
with
about

into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
needn't
shan
shan't
shouldn
shouldn't
wasn
wasn't
weren
weren't
won
won't
wouldn
wouldn't

NLTK' Stopwords

Code Demo

LDA model

gensim.models.LdaMulticore(corpus=corpus_All_tfidf,
                           id2word=id2word_All,
                           num_topics=10, 
                           random_state=100,
                           chunksize=100,
                           passes=10,
                           per_word_topics=True)
# Model: Debate all, T vs B
test_doc = [text_DT_2020_1,text_DT_2020_2]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [id2word_All.doc2bow(doc) for doc in test_doc]
tfidf = models.TfidfModel(test_corpus)
corpus_tfidf = tfidf[test_corpus]

# Cossim
doc1 = lda_model_All_tfidf.get_document_topics(corpus_tfidf[0], minimum_probability=0.1)
doc2 = lda_model_All_tfidf.get_document_topics(corpus_tfidf[1], minimum_probability=0.1)
print("Model: Debate all, Trump 2020")
print(cossim(doc1, doc2))

test_doc2 = [text_JB_2020_1,text_JB_2020_2]
test_doc2 = [doc.split() for doc in test_doc2]
test_corpus2 = [id2word_All.doc2bow(doc) for doc in test_doc2]
tfidf2 = models.TfidfModel(test_corpus2)
corpus_tfidf2 = tfidf[test_corpus2]

doc3 = lda_model_All_tfidf.get_document_topics(corpus_tfidf2[0], minimum_probability=0.1)
doc4 = lda_model_All_tfidf.get_document_topics(corpus_tfidf2[1], minimum_probability=0.1)
print("Model: Debate all, Biden 2020")
print(cossim(doc3, doc4))

Topic similarity

References:

[1] Meg Risdal - Kaggle. 2016 US presidential debates(Link)

[2] Heads or Tails - Kaggle. US Election 2020 - Presidential Debates(Link)

[3] Ari Aulia Hakim, Alva Erwin, Kho I Eng, Maulahikmah Galinium, Wahyu Muliady. Oct. 2014. “Automated document classification for news articles in Bahasa Indonesia based on term frequency-inverse document frequency (TF-IDF) approach”(Link).

[4]    Meylan Wongkar, Apriandy Angdresey. Oct. 2019.  “Sentiment Analysis Using Naive Bayes Algorithm Of The Data Crawler: Twitter”(Link).

[5]    Yonghe Lu, Yawen Zheng. Nov. 2018. “Subject Analysis of the Microblog About US Presidential Election Based on LDA”(Link).
[6]    David M. Blei. Apr. 2012. Probabilistic Topic Models(Link).

Thank you!

2020 U.S Presidential Election Analysis - Team 10

By jackiechen08

2020 U.S Presidential Election Analysis - Team 10

CAP6307 - Textmining (UCF MSDA)

  • 278