Web Scrapping Example
TripAdvisor Demo
-
Goal
-
Retrieve the latest reviews about the Spartanburg Marriott from from tripadvisor.com
-
-
Good
-
No click interactions required
-
-
Bad
-
CSS selectors required
-
TripAdvisor Demo
CSS Selectors
-
In Chrome
-
Right click on the first review description paragraph
-
Select: "Inspect Element"
-
-
Notice
-
All reviews are inside a
<div id="REVIEWS">...</div>
-
Each review is inside a <div class="innerBubble">...</div>
-
TripAdvisor Demo
rvest
library(rvest) library(magrittr) # view the webpage below in your browser url <- "http://www.tripadvisor.com/Hotel_Review-g54448-d288616-Reviews-Spartanburg_Marriott-Spartanburg_South_Carolina.html" ## Read html from url location, # then extract all divs with class ".innerBubble" # from within the item with id = "REVIEWS" reviews <- url %>% read_html() %>% html_nodes("div#REVIEWS div.innerBubble")
TripAdvisor Demo
# retrieve a unique identifier id <- reviews %>% html_node(".quote a") %>% html_attr("id") # retrieve single line quote quote <- reviews %>% html_node(".quote span") %>% html_text() # retrieve user rating rating <- reviews %>% html_node(".rating .rating_s_fill") %>% html_attr("alt") %>% gsub(" of 5 stars", "", .) %>% as.integer()
TripAdvisor Demo
# retrieve user review
review <- reviews %>%
html_node(".entry .partial_entry") %>%
html_text()
# Combine results
dt <- data.frame(
id, quote, rating, review,
stringsAsFactors = FALSE
)
TripAdvisor Demo
head(dt)
id quote rating review
1 rn314739986 Nice Hotel 4 \nI was pleasantly surprised ...
2 rn314556052 Nice hotel 4 \nVery nice. We stayed one night on ...
3 rn313992798 Wofford visit 3 \nWe've been to this Marriott several...
4 rn307653018 Older Marriott 4 \nI was not impressed and I was not ...
5 rn307354894 Irritated 3 \nI had made my choice to stay at ...
6 rn303856071 Marginal 3 \nI don't typically stay at Marriott's...
Cut off for spacing
More Possibilities
- RSelenium / phantomjs
- Loads full webpage
- including flash and javascript calls
- Interact with webpage
- Loads full webpage
- rvest
- post process information provided by RSelenium
Web Scrapping Example
By Barret Schloerke
Web Scrapping Example
Quick demo of rvest package
- 1,767