Web Scraping with Python
Bhavika Tekwani
Prerequisites
- Some knowledge of Python
- HTML/CSS
Why scrape a webpage?
- Save data trapped in webpages
- To obtain data in the absence of an API
- Stay anonymous
Workshop Outline
- BeautifulSoup & Requests
- Scrape weather.gov
- Save data to a CSV file
BeautifulSoup
- Go to weather.gov
- Find the "Local forecast by City, State, ZIP"
- Enter Washington DC
BeautifulSoup
Text
from bs4 import BeautifulSoup
import requests
# specify the URL you're visitng
link = "http://forecast.weather.gov/MapClick.php?lat=38.8951&lon=-77.0364#.WO-9S3UrLCI"
# request a web page!
page = requests.get(link)
# 200 - means success, 404 - page not found, 500 - server error
print (page.status_code)
# show the HTML structure of the webpage
print (page.content)
soup = BeautifulSoup(page.content, 'html.parser')
# find the ID for the seven day forecast section of the page
# use the 'find' method to get that section
seven_day = soup.find(id='seven-day-forecast')
# class in HTML refers to a style defined in the CSS stylesheet for the page
# find - gets one element or the first occurrence of a search term
# find all - gets all elements matching the search term
forecast_items = seven_day.find_all(class_="tombstone-container")
print (forecast_items)
tonight = forecast_items[0]
print(tonight.prettify())
# Find the image in the section with the 'img' tag
# The title is connected to the image
img = tonight.find("img")
desc = img['title']
print(desc)
# Let's get the forecast for all 7 days
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)
Scraping multiple items
# Repeat the same step for text that we want to extract
# short_descs is short description
# temps is temperature
# desc is description
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(short_descs)
print(temps)
print(descs)
Making Lists out of Extracted Text
Writing data to a file
# zip - links corresponding elements of multiple lists together
# For example,
# a = [1, 2, 3]
# b = [4, 5, 6]
# c = zip(a, b)
# c looks like : (1, 4) (2, 5) (3, 6)
import csv
data = list(zip(periods, short_descs, temps, descs))
# Open a file in write mode - 'w'
with open('weather.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, delimiter=',')
for i in data:
l = list(i)
writer.writerow(l)
Resources
Web Scraping with Python
By Bhavika Tekwani
Web Scraping with Python
Workshop at Digital Scholarship Center, Fenwick Library on 14th April, 2017.
- 3,406