Sentiment
analysis in R
opinion mining for fun and profit
useR Stockholm 2013-03-14
Me?
Joakim Lundborg
cortex@github
http://github.com/cortex
Text mining
Data analysis
Android
Random hacking
Sentiment analysis on the web
http://www.google.com/shopping
Sentiment analysis @ alaTest
http://alatest.com/reviews/cell-phone-reviews/google-nexus-4/po3-194109466,8/
millions of reviews +
multilingual opinion mining +
multilingual text generation
= win
(I built this =))
millions of reviews +
multilingual opinion mining +
multilingual text generation
= win
(I built this =))
Sentiment analysis is hard
- Irony
- POV
- context
- messy data
Sentiment analysis is easy
- Not even humans agree
- Law of big numbers
- Results look cool
- People like to find patterns and explanations for subjective errors
- Tools are in place
- Besides, who is going to read through those 1500+ reviews and tell me I'm wrong
Text mining packages in R
tm
tui
openNLP
openNLP
Why would I do this in R anyway?
Most good text analysis packages are written for other languages, primarily python
but R is really good at munging data sets, given the right packages the code is very elegant
...but don't try this in production =)
but R is really good at munging data sets, given the right packages the code is very elegant
...but don't try this in production =)
Getting reviews
scrapePrisjakt <- function(.url){
s <- 0
out <- {}
while(T){
url <- paste0(.url, "&s=", s)
page <- getURL(url, .encoding="UTF-8")
parsedPage <-htmlParse(page)
reviewNodes <- getNodeSet(parsedPage, "//li[@class='opinion-row']//div[@class='text-truncated']")
reviews <- lapply(reviewNodes, function(r){paste0(xmlApply(r, xmlValue), collapse="")})
reviews <- lapply(reviews, function(r){
r <- gsub("(\n)+", " ", r)
r <- gsub("(\t)+", " ", r)
str_trim(r)
})
if (length(reviews) == 0){break}
out <- c(out, reviews)
s <- s + 50
}
print(paste("Scraped", length(out), "reviews for", .url))
unlist(out)
}
Processing
Split sentences
sentences <- unlist(lapply(iphone4, sentDetect))
sentences.scored <- score.sentiment(sentences, pos, neg)
And do some work counting
corpus <- Corpus(DataframeSource(data.frame(docs=.sentences)))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("swedish")))
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
print(v[1:100])
v <- v[match(names(v), .features, F)]
getting fancy
In real life you will probably want to spend lots of time improving this
- Cleaning data
- Better domain-specific lexica
- POS-tagging
- Ngrams instead of words
- Language-dependent stemming
Presenting
Everybody loves word clouds d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, .palette)
pal <- pal[-(1:2)]
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors=pal, vfont=c("sans serif","plain"))
...but this is R, we can get normal graphs as well
...BUT WHERE DO WE GET THE lexica?
Lots of resources on the web, mostly in English:
sentiwordnet
wordnet-affect
http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
...BUT I WANT IT IN SWEDISH
Do like the professionals, google it.
In this case google translate it
Questions?
happy PI day!
text analytics in r
By cortex
text analytics in r
- 4,811