Sentiment

analysis in R

opinion mining for fun and profit

http://www.rvl.io/cortex/text-analytics-in-r


useR Stockholm  2013-03-14

Me?

Joakim Lundborg
cortex@github
http://github.com/cortex

Text mining
Data analysis
Android
Random hacking

Sentiment analysis on the web

http://www.google.com/shopping


Sentiment analysis @ alaTest

http://alatest.com/reviews/cell-phone-reviews/google-nexus-4/po3-194109466,8/

millions of reviews +
multilingual opinion mining +
multilingual text generation
 = win

(I built this =))

Sentiment analysis is hard

  • Irony
  • POV
  • context
  • messy data

Sentiment analysis is easy

  • Not even humans agree
  • Law of big numbers
  • Results look cool
  • People like to find patterns and explanations for subjective errors
  • Tools are in place
  • Besides, who is going to read through those 1500+ reviews and tell me I'm wrong

Text mining packages in R

tm
tui
openNLP

... all awesome & quirky, just like R

Why would I do this in R anyway?

Most good text analysis packages are written for other languages, primarily python

but R is really good at munging data sets, given the right packages the code is very elegant

...but don't try this in production =)

Getting reviews


scrapePrisjakt <- function(.url){
  s <- 0
  out <- {}
  while(T){
    url <- paste0(.url, "&s=", s)
    page <- getURL(url, .encoding="UTF-8")
    parsedPage <-htmlParse(page)
    reviewNodes <- getNodeSet(parsedPage, "//li[@class='opinion-row']//div[@class='text-truncated']")
    reviews <- lapply(reviewNodes, function(r){paste0(xmlApply(r, xmlValue), collapse="")})
 reviews <- lapply(reviews, function(r){ r <- gsub("(\n)+", " ", r) r <- gsub("(\t)+", " ", r) str_trim(r) }) if (length(reviews) == 0){break} out <- c(out, reviews) s <- s + 50 } print(paste("Scraped", length(out), "reviews for", .url)) unlist(out) }

Processing

Split sentences

sentences <- unlist(lapply(iphone4, sentDetect))  
sentences.scored <- score.sentiment(sentences, pos, neg)
And do some work counting
  corpus <- Corpus(DataframeSource(data.frame(docs=.sentences)))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("swedish")))
  tdm <- TermDocumentMatrix(corpus)
  
  m <- as.matrix(tdm)
  v <- sort(rowSums(m),decreasing=TRUE)
  print(v[1:100])
  v <- v[match(names(v), .features, F)]  

getting fancy

In real life you will probably want to spend lots of time improving this


  • Cleaning data
  • Better domain-specific lexica
  • POS-tagging
  • Ngrams instead of words
  • Language-dependent stemming

Presenting

Everybody loves word clouds
  d <- data.frame(word = names(v),freq=v)
  pal <- brewer.pal(9, .palette)
  pal <- pal[-(1:2)]
  wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
...but this is R, we can get normal graphs as well

...BUT WHERE DO WE GET THE lexica?


Lots of resources on the web, mostly in English:
sentiwordnet
wordnet-affect
https://app.viralheat.com//developer/sentiment_api
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

...BUT I WANT IT IN SWEDISH

Do like the professionals, google it.

In this case google translate it



Questions?


happy  PI  day!