Sentiment

analysis in R

opinion mining for fun and profit

http://www.rvl.io/cortex/text-analytics-in-r

useR Stockholm 2013-03-14

Me?

Joakim Lundborg

cortex@github

http://github.com/cortex

Text mining

Data analysis

Android

Random hacking

Sentiment analysis on the web

http://www.google.com/shopping

Sentiment analysis @ alaTest

http://alatest.com/reviews/cell-phone-reviews/google-nexus-4/po3-194109466,8/

millions of reviews +
multilingual opinion mining +
multilingual text generation
= win

(I built this =))

Sentiment analysis is hard

Irony
POV
context
messy data

Sentiment analysis is easy

Not even humans agree
Law of big numbers
Results look cool
People like to find patterns and explanations for subjective errors
Tools are in place
Besides, who is going to read through those 1500+ reviews and tell me I'm wrong

Text mining packages in R

tui
openNLP

... all awesome & quirky, just like R

Why would I do this in R anyway?

Most good text analysis packages are written for other languages, primarily python

but R is really good at munging data sets, given the right packages the code is very elegant

...but don't try this in production =)

Getting reviews


scrapePrisjakt <- function(.url){
  s <- 0
  out <- {}
  while(T){
    url <- paste0(.url, "&s=", s)
    page <- getURL(url, .encoding="UTF-8")
    parsedPage <-htmlParse(page)
    reviewNodes <- getNodeSet(parsedPage, "//li[@class='opinion-row']//div[@class='text-truncated']")
    reviews <- lapply(reviewNodes, function(r){paste0(xmlApply(r, xmlValue), collapse="")})
    reviews <- lapply(reviews, function(r){
      r <- gsub("(\n)+", " ", r)
      r <- gsub("(\t)+", " ", r)      
      str_trim(r)
    })
    
    if (length(reviews) == 0){break}
    out <- c(out, reviews)
    s <- s + 50
  }
  print(paste("Scraped", length(out), "reviews for", .url))
  unlist(out)
}

Processing

Split sentences


sentences <- unlist(lapply(iphone4, sentDetect))  
sentences.scored <- score.sentiment(sentences, pos, neg)

And do some work counting

  corpus <- Corpus(DataframeSource(data.frame(docs=.sentences)))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("swedish")))
  tdm <- TermDocumentMatrix(corpus)
  
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  print(v[1:100])
  v <- v[match(names(v), .features, F)]

getting fancy

In real life you will probably want to spend lots of time improving this

Cleaning data
Better domain-specific lexica
POS-tagging
Ngrams instead of words
Language-dependent stemming

Presenting

Everybody loves word clouds

  d <- data.frame(word = names(v),freq=v)
  pal <- brewer.pal(9, .palette)
  pal <- pal[-(1:2)]
  wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))

...but this is R, we can get normal graphs as well

...BUT WHERE DO WE GET THE lexica?

Lots of resources on the web, mostly in English:
sentiwordnet

wordnet-affect

https://app.viralheat.com//developer/sentiment_api

http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

...BUT I WANT IT IN SWEDISH

Do like the professionals, google it.

In this case google translate it

Questions?

happy PI day!