Web Scrapping Example

TripAdvisor  Demo

  • Goal

    • Retrieve the latest reviews about the Spartanburg Marriott from from tripadvisor.com

  • Good

    • No click interactions required

  • Bad

    • CSS selectors required

 

TripAdvisor Demo

CSS Selectors

  • In Chrome

    • Right click on the first review description paragraph

    • Select: "Inspect Element"

  • Notice

    • All reviews are inside a

      <div id="REVIEWS">...</div>
    • Each review is inside a
      <div class="innerBubble">...</div>

TripAdvisor Demo

rvest

library(rvest)
library(magrittr)

# view the webpage below in your browser
url <- "http://www.tripadvisor.com/Hotel_Review-g54448-d288616-Reviews-Spartanburg_Marriott-Spartanburg_South_Carolina.html"

## Read html from url location, 
# then extract all divs with class ".innerBubble" 
# from within the item with id = "REVIEWS"
reviews <- url %>%
  read_html() %>%
  html_nodes("div#REVIEWS div.innerBubble")

 

TripAdvisor Demo

# retrieve a unique identifier
id <- reviews %>%
  html_node(".quote a") %>%
  html_attr("id")

# retrieve single line quote
quote <- reviews %>%
  html_node(".quote span") %>%
  html_text()

# retrieve user rating
rating <- reviews %>%
  html_node(".rating .rating_s_fill") %>%
  html_attr("alt") %>%
  gsub(" of 5 stars", "", .) %>%
  as.integer()

TripAdvisor Demo

# retrieve user review
review <- reviews %>%
  html_node(".entry .partial_entry") %>%
  html_text()

 

# Combine results
dt <- data.frame(
  id, quote, rating, review,
  stringsAsFactors = FALSE
)

TripAdvisor Demo

head(dt)

           id          quote rating review
1 rn314739986     Nice Hotel      4 \nI was pleasantly surprised ... 
2 rn314556052     Nice hotel      4 \nVery nice. We stayed one night on ...
3 rn313992798  Wofford visit      3 \nWe've been to this Marriott several... 
4 rn307653018 Older Marriott      4 \nI was not impressed and I was not ...
5 rn307354894      Irritated      3 \nI had made my choice to stay at ...
6 rn303856071       Marginal      3 \nI don't typically stay at Marriott's... 

Cut off for spacing

More Possibilities

  • RSelenium / phantomjs
    • Loads full webpage
      • including flash and javascript calls
    • Interact with webpage
  • rvest
    • post process information provided by RSelenium
Made with Slides.com