Web Scraping with Python

Rohan Bidarkota

rbidarko@gmu.edu

Prerequisites

Some knowledge of Python
HTML/CSS

Why scrape a webpage?

Save data trapped in webpages
To obtain data in the absence of an API
Stay anonymous

Workshop Outline

BeautifulSoup & Requests
Scrape weather.gov
Save data to a CSV file

BeautifulSoup

Go to weather.gov
Find the "Local forecast by City, State, ZIP"
Enter Washington DC

BeautifulSoup

from bs4 import BeautifulSoup
import requests

# specify the URL you're visitng

link = "http://forecast.weather.gov/MapClick.php?lat=38.8951&lon=-77.0364#.WO-9S3UrLCI"

# request a web page!

page = requests.get(link)


# 200 - means success, 404 - page not found, 500 - server error
print (page.status_code)

# show the HTML structure of the webpage
print (page.content)


soup = BeautifulSoup(page.content, 'html.parser')


# find the ID for the seven day forecast section of the page
# use the 'find' method to get that section

seven_day = soup.find(id='seven-day-forecast')

# class in HTML refers to a style defined in the CSS stylesheet for the page
# find - gets one element or the first occurrence of a search term
# find all - gets all elements matching the search term


forecast_items = seven_day.find_all(class_="tombstone-container")

print (forecast_items)

tonight = forecast_items[0]
print(tonight.prettify())


# Find the image in the section with the 'img' tag

# The title is connected to the image

img = tonight.find("img")
desc = img['title']

print(desc)



# Let's get the forecast for all 7 days


period_tags = seven_day.select(".tombstone-container .period-name")

periods = [pt.get_text() for pt in period_tags]

print(periods)

Scraping multiple items

 
# Repeat the same step for text that we want to extract 
# short_descs is short description
# temps is temperature
# desc is description

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

Making Lists out of Extracted Text

Writing data to a file



# zip - links corresponding elements of multiple lists together

# For example, 
#    a = [1, 2, 3]
#    b = [4, 5, 6]
#    c = zip(a, b)
#    c looks like : (1, 4) (2, 5) (3, 6)

import csv

data = list(zip(periods, short_descs, temps, descs))


# Open a file in write mode - 'w'
    
with open('weather.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f, delimiter=',')
    for i in data:
        l = list(i)
        writer.writerow(l)

Web Scraping with Python

Prerequisites

Why scrape a webpage?

Workshop Outline

BeautifulSoup

BeautifulSoup

Resources