Goal
Retrieve the latest reviews about the Spartanburg Marriott from from tripadvisor.com
Good
No click interactions required
Bad
CSS selectors required
CSS Selectors
In Chrome
Right click on the first review description paragraph
Select: "Inspect Element"
Notice
All reviews are inside a
<div id="REVIEWS">...</div>
Each review is inside a <div class="innerBubble">...</div>
rvest
library(rvest) library(magrittr) # view the webpage below in your browser url <- "http://www.tripadvisor.com/Hotel_Review-g54448-d288616-Reviews-Spartanburg_Marriott-Spartanburg_South_Carolina.html" ## Read html from url location, # then extract all divs with class ".innerBubble" # from within the item with id = "REVIEWS" reviews <- url %>% read_html() %>% html_nodes("div#REVIEWS div.innerBubble")
# retrieve a unique identifier id <- reviews %>% html_node(".quote a") %>% html_attr("id") # retrieve single line quote quote <- reviews %>% html_node(".quote span") %>% html_text() # retrieve user rating rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer()
# retrieve user review
review <- reviews %>%
html_node(".entry .partial_entry") %>%
html_text()
# Combine results
dt <- data.frame(
id, quote, rating, review,
stringsAsFactors = FALSE
)
head(dt)
id quote rating review
1 rn314739986 Nice Hotel 4 \nI was pleasantly surprised ...
2 rn314556052 Nice hotel 4 \nVery nice. We stayed one night on ...
3 rn313992798 Wofford visit 3 \nWe've been to this Marriott several...
4 rn307653018 Older Marriott 4 \nI was not impressed and I was not ...
5 rn307354894 Irritated 3 \nI had made my choice to stay at ...
6 rn303856071 Marginal 3 \nI don't typically stay at Marriott's...
Cut off for spacing