Scraping and Parsing Data

Robert J. McGrath

Assistant Professor

 

​rmcgrat2@gmu.edu

What you wish your data looked like

What your data actually look like​, if you're lucky

What you wish your data looked like

What your data actually look like

What you wish your data looked like

What your data actually look like

Unified Agenda

        - Bulk XML

        - Parse with Python (ElementTree) 

Congressional hearings

          - No bulk XML

          - Scraped GPO's sitemap for html urls  (shell)            

          - Parse with Python (BeautifulSoup) 

State legislative directories 

          - Scanned books for PDFs

          - Python program (Regular Expressions) to parse           

Tools

    - Scraping: browser add-ons, 

      out of the box applications, 

      programming (Python, Perl,    

      Ruby, PHP, R, shell)

- Parsing: same programming tools as above

rm(list=ls())
library(RCurl)
library(stringr)
library(maps)
library(rvest)


buildings_parsed <- read_html("https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_the_United_States",
                             encoding = "UTF-8")
tables <- html_table(buildings_parsed, fill = TRUE)

buildings_table <- tables[[1]]
names(buildings_table)

buildings_table <- buildings_table[, c(1, 2, 4, 5)]
names(buildings_table)

colnames(buildings_table) <- c("rank", "name", "locn", "height")
buildings_table$name[1:5]
buildings_table$locn[1:5]

reg_y <- "[/][ -]*[[:digit:]]*[.]*[[:digit:]]*[;]"
reg_x <- "[;][ -]*[[:digit:]]*[.]*[[:digit:]]*"
y_coords <- str_extract(buildings_table$locn, reg_y)
y_coords <- as.numeric(str_sub(y_coords, 3, -2))
buildings_table$y_coords <- y_coords

x_coords <- str_extract(buildings_table$locn, reg_x)
x_coords <- as.numeric(str_sub(x_coords, 3, -1))
buildings_table$x_coords <- x_coords
buildings_table$locn <- NULL

round(buildings_table$y_coords, 2)[1:3]

dim(buildings_table)
head(buildings_table)



map("state",  col = "darkgrey", lwd = 0.5, mar = c(0.1, 0.1, 0.1, 0.1))
points(buildings_table$x_coords, buildings_table$y_coords, pch=2)