Web scraping
with Python

Philo van Kemenade - @phivk

General Assembly London

Recap

Variables

What Types ?

Collections
Lists
Dictionaries

Recap

Loops

for

while

Conditional Statements

if, elif, else

Recap

Functions

take input as arguments

return output

Modules

extending behaviour

Script vs. Interpreter

Homework

pip install beautifulsoup4

This workshop

Getting data from the web

Web Scraping
is it weekend yet?
Books to scrape

Mini Project
Crypto converter
Your project

these slides

bit.ly/scrapingwithpython

Data on the web

Structured in databases

structured and indexed
hidden on the server-side of a web platform
may be accessible via API (Application Programming Interface)

Semi-structured on web pages

different structure for every page
available in your browser
extractable by scraping

Web Scraping

Extracting data from a web page’s source
For example: http://isitweekendyet.com/

Pseudo code

(0. Determine what data we are looking for)

Read page HTML
Parse raw HTML string into nicer format
Extract what we’re looking for
Process our extracted data
Store/print/action based on data

0. Determine what you're looking for

& where it lives on the page

Pro Tip: use your browser's inspector

1. Read page HTML

from urllib.request import urlopen
url = 'http://isitweekendyet.com/'
pageSource = urlopen(url).read()

2. Parse from Page to soup

Beautiful Soup

You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

http://www.crummy.com/software/BeautifulSoup/

Powered by Beautiful Soup

Moveable Type at the NY Times (source)

2. Parse from Page to soup

from bs4 import BeautifulSoup
weekendSoup = BeautifulSoup(pageSource, "html.parser")

Now we have easy access to stuff like:

>>> weekendSoup.title
 <title>Is it weekend yet?</title>

The Shape of HTMl?

The Shape of HTMl

Objects in Soup

Our soup is a tree of different types of objects:

Tags

Strings

Tag

an element in our HTML

>>> tag = weekendSoup.div
>>> tag
<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>
>>> type(tag)
<class 'bs4.element.Tag'>

String

Text within a tag

>>> tag.string
u'\nYES!\n'
>>> type(tag.string)  <class 'bs4.element.NavigableString'>

3. Extract what we’re looking for

Navigating the parsed HTML tree

Search functions

Bonus: CSS Selectors

Navigating the Soup Tree

Navigating the Soup Tree

. tagName

.string

.strings & .stripped_strings
.contents & .children
.descendants

.parent & .parents

.next_sibling(s)
.previous_sibling(s)

Going Down the Tree

.tagName

access a tag by its name

>>> weekendSoup.title
<title>Is it weekend yet?</title>

Going Down the Tree

.string

access the String inside a tag

>>> weekendSoup.title.string
'Is it weekend yet?'

more below for reference...

Going Down the Tree

tag.strings

tag.stripped_strings

>>> for ss in body.div.stripped_strings: 
  print(ss)
... 
YES!

Going Down the Tree

.contents & .children

return a tag's direct children as list or generator

>>> bodyTag = weekendSoup.body
>>> bodyTag.contents
[u'\n', <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>, u'\n', <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>, u'\n']

<div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>


<div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>

Going Down the Tree

.descendants

return all of the tag's children, grand-children (etc)

>>> for d in bodyTag.descendants: print d... 


<div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>

YES!



<div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>


<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
The weekend has landed!

Going Up the Tree

.parent & .parents

access a tag's parent(s)

>>> soup.a.parent.name
u'div'
>>> for p in soup.a.parents: print p.name
... 
div
body
html
[document]

Going Sideways

.next_sibling(s) & .previous_sibling(s)

access a tag's brothers and sisters

>>> weekendSoup.div.next_sibling
u'\n'
>>> weekendSoup.div.next_sibling.next_sibling
<div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>

Exercise

write a script:

Use Beautiful Soup to navigate to the answer to our question:

Is it weekend yet?

'''A simple script that tells us if it's weekend yet'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage

# parse HTML into Beautiful Soup

# extract data from parsed soup

# print answer

Searching the Soup-Tree

Using a filter in a search function

to zoom into a part of the soup

String Filter

find by element name

>>> weekendSoup.find_all('div')
[<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
NO
</div>, <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
don't worry, you'll get there!
</div>]

Search Functions

find_all()

find()

using Find_all()

find_all( name of tag )

find_all( attribute filter )

find_all( name of tag, attribute filter)

Tag name

>>> weekendSoup.find_all('div')
[<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
NO
</div>, <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
don't worry, you'll get there!
</div>]

attribute filters

urlBooks = 'http://books.toscrape.com/'
pageSourceBooks = urlopen(urlBooks).read()
booksSoup = BeautifulSoup(pageSourceBooks, "html.parser")

soumissionLinks = booksSoup.find_all(
  'a',
  href='catalogue/soumission_998/index.html'
)
books = booksSoup.find_all('article', class_='product_pod')

be careful to use "class_" when filtering based on class name

Find()

like find_all(), but limited to one result

>>> booksSoup.find('title')
<title>
    All products | Books to Scrape - Sandbox
</title>

find_all() returns a list

find returns a tag

Search All the directions!

They work like find_all() & find()

find_parents()

find_parent()

find_next_siblings()

find_next_sibling()

find_previous_siblings()

find_previous_sibling()

Exercise

How many projects on (the first page of) the Book Store

have a 1 star rating?

pro tip: use a search function

books.toscrape.com

Bonus:

can you get the count for each of the different ratings?

⬇️ see template ⬇️

same Template

'''A simple script that scrapes book ratings'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage


# parse HTML into Beautiful Soup


# extract data from parsed soup

Bonus: using Css Selectors

select by id

soup.select("#content")soup.select("div#content")

select by class

soup.select(".byline")soup.select("li.byline")

select beneath tag

soup.select("#content a")

select directly beneath tag

soup.select("#content > a")

Bonus: using Css Selectors

check if attribute exists

soup.select('a[href]')

find by attribute value

soup.select('a[href="http://www.theguardian.com/profile/brianlogan"]')

attribute value starts with, ends with or contains

soup.select('a[href^="http://www.theguardian.com/"]')
soup.select('a[href$="info"]')
[<a class="link-text" href="http://www.theguardian.com/info">About us,</a>, <a class="link-text" href="http://www.theguardian.com/info">About us</a>]
>>> guardianSoup.select('a[href*=".com/contact"]')
[<a class="rollover contact-link" href="http://www.theguardian.com/contactus/2120188" title="Displays contact data for guardian.co.uk"><img alt="" class="trail-icon" src="http://static.guim.co.uk/static/ac46d0fc9b2bab67a9a8a8dd51cd8efdbc836fbf/common/images/icon-email-us.png"/><span>Contact us</span></a>]

4. Process extracted data

we generally want to:

clean up

calculate

process

Cleaning up + processing

>>> answer = soup.div.string
>>> answer
'\nNO\n'
>>> cleaned = answer.strip()
>>> cleaned
'NO'
>>> isWeekendYet = cleaned == 'YES'
>>> isWeekendYet
False

many useful string methods: https://docs.python.org/3/library/stdtypes.html#string-methods

5. Do stuff with our extracted data

For example print to screen:

# print info to screen
print('Is it weekend yet? ', isWeekendYet)

or save to .csv file

import csv

with open('weekends.csv', 'w', newline='') as csvfile:
    weekendWriter = csv.writer(csvfile)
    weekendWriter.writerow(weekendYet)

https://docs.python.org/3.3/library/csv.html

Mini Project

Let's LiteCoin!

Mini Project

Build a BitCoin to GBP converter

Get the current BitCoin/GBP exchange rate via

http://preev.com/btc/gbp

use the input() function to get a user's input

(let the slides help you)

⬇️ see template ⬇️

All together Now

fill in the blanks...

'''A simple script that ... '''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage
url = 
pageSource = 

# parse HTML into Beautiful Soup
mySoup = 

# extract data from soup

# clean up data

# process data

# action based on data

Questions?

Useful Resources

google; “python” + your problem / question

python.org/doc/; official python documentation, useful to find which functions are available

stackoverflow.com; huge gamified help forum with discussions on all sorts of programming questions, answers are ranked by community

codecademy.com/tracks/python; interactive exercises that teach you coding by doing

wiki.python.org/moin/BeginnersGuide/Programmers; tools, lessons and tutorials

Useful Modules etc

reading and writing files

reading and writing .csv

reading and writing .json

mechanize: scraping behind logins & forms

Maths & Matrices with Numpy

Data analysis with Pandas

Plotting graphs with MatPlotLib

Data Science:

Jupyter Notebook

Anaconda

Thank you

@phivk

Navigating to the weekend yet answer

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> url = "http://isitweekendyet.com/"
>>> source = urlopen(url).read()
>>> soup = BeautifulSoup(source)

>>> soup.body.div.string
'\nNO\n'

# an alternative:
>>> list(soup.body.stripped_strings)[0]
'NO'

many routes possible...

Book Store

'''A simple script that scrapes info about Books'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage
urlBooks = "http://books.toscrape.com/"
pageSourceBooks = urlopen(urlBooks).read()

# parse HTML into Beautiful Soup
booksSoup = BeautifulSoup(pageSourceBooks, "html.parser")

# extract data from parsed soup
booksOneStar = booksSoup.find_all('p', class_="star-rating One")
oneStarCount = len(booksOneStar)
print(oneStarCount)

#########
# Bonus #
#########

# simple approach

books1StarCount = len(booksSoup.find_all('p', class_="star-rating One"))
books2StarCount = len(booksSoup.find_all('p', class_="star-rating Two"))
books3StarCount = len(booksSoup.find_all('p', class_="star-rating Three"))
books4StarCount = len(booksSoup.find_all('p', class_="star-rating Four"))
books5StarCount = len(booksSoup.find_all('p', class_="star-rating Five"))

print("1 star: ", books1StarCount)
print("2 star: ", books2StarCount)
print("3 star: ", books3StarCount)
print("4 star: ", books4StarCount)
print("5 star: ", books5StarCount)

# more elegant approach

def getStarCount(booksSoup, starClass):
  booksWithStarClass = booksSoup.find_all('p', class_="star-rating "+starClass)
  starCount = len(booksWithStarClass)
  return starCount

starClasses = ["One", "Two", "Three", "Four", "Five"]

starCounts = {}
for starClass in starClasses:
  starCount = getStarCount(booksSoup, starClass)
  starCounts[starClass] = starCount

print(starCounts)

THE Bitcoin converter

 '''A simple script that converts BTC to GBP based on live rate'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage
url = "https://exchangerate.guru/btc/"
pageSource = urlopen(url).read()

# parse HTML into Beautiful Soup
btcSoup = BeautifulSoup(pageSource, "html.parser")

# extract data from parsed soup

gbpRateTag = btcSoup.find('a', href='/btc/gbp/1/')
gbpRate = float(gbpRateTag.string)

# get input from user

noBitcoinString = input('How many Bitcoin have you got? ')
noBitcoin = float(noBitcoinString)

# calculate and print answer
noGBP = noBitcoin * gbpRate

print("WOW! you have " + str(noGBP) + "£!!!")

A LiteCoin converter

'''
LiteCoin converter that tells us how much your LiteCoins are worth in EURO
NB: expects python3
'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

def clean_up_rate(rateString):
    ''''Clean up raw rateString to form rate'''
    rateStringStripped = rateString.strip()
    rateNumber = rateStringStripped[1:10]
    return float(rateNumber)

# open webpage
url = "http://litecoinexchangerate.org/c/EUR"
pageSource = urlopen(url).read()

# turn html into beautiful soup
liteCoinSoup = BeautifulSoup(pageSource, "html.parser")

# extract info from soup
rateString = liteCoinSoup.find('b').string

# clean up data
rate = clean_up_rate(rateString)

# get user input
litecoinsString = input("How many litecoins have you got?\n>>> ")
litecoins = float(litecoinsString)

# print output
EURO = rate * litecoins
print("You have", round(EURO,2), "EURO!")

remember: readability counts!

A bitcoin converter

'''
Bitcoin converter that tells us how much your bitcoins are worth in GBP
NB: expects python3
'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

def clean_up_rate(rateString):
    ''''Clean up raw rateString to form rate'''
    rateNumber = rateString[1:10]
    return float(rateNumber)

def main():
    '''Our main function that gets called when we run the program'''
    # open webpage
    url = "http://bitcoinexchangerate.org/c/GBP/1"
    webpage = urlopen(url).read()

    # turn html into beautiful soup
    bitcoinSoup = BeautifulSoup(webpage, "html.parser")

    # extract info from soup
    rateString = bitcoinSoup.find('b').string.strip()

    # clean up data
    rate = clean_up_rate(rateString)

    # get user input
    bitcoinsString = input("How many bitcoins have you got?\n>>> ")
    bitcoins = float(bitcoinsString)

    # print output
    GBP = rate * bitcoins
    print("You have", round(GBP,2), "GBP!")

# this kicks off our program & lets us both run and import the program
if __name__ == '__main__':
    main()

remember: readability counts!

Webscraping with Python

By Philo van Kemenade

Webscraping with Python

A practical introduction to webscraping with Python

4,845

Philo van Kemenade

Creating tools, stories and things in between to amplify human connection with arts and culture.

Web scrapingwith Python

Philo van Kemenade - @phivk

General Assembly London

Recap

Recap

Recap

Homework

This workshop

these slides

Data on the web

Web Scraping

Pseudo code

0. Determine what you're looking for

& where it lives on the page

Pro Tip: use your browser's inspector

1. Read page HTML

2. Parse from Page to soup

Beautiful Soup

Powered by Beautiful Soup

2. Parse from Page to soup

The Shape of HTMl?

The Shape of HTMl

Objects in Soup

Tag

String

3. Extract what we’re looking for

Navigating the parsed HTML tree

Search functions

Bonus: CSS Selectors

Navigating the Soup Tree

Navigating the Soup Tree

Going Down the Tree

Going Down the Tree

more below for reference...

Going Down the Tree

Going Down the Tree

Going Down the Tree

Going Up the Tree

Going Sideways

Exercise

Searching the Soup-Tree

String Filter

Search Functions

using Find_all()

Tag name

attribute filters

Find()

Search All the directions!

Exercise

same Template

Bonus: using Css Selectors

Bonus: using Css Selectors

4. Process extracted data

Cleaning up + processing

5. Do stuff with our extracted data

or save to .csv file

Mini Project

Mini Project

All together Now

Questions?

Useful Resources

Useful Modules etc

Thank you

Navigating to the weekend yet answer

Book Store

THE Bitcoin converter

A LiteCoin converter

A bitcoin converter

Webscraping with Python

More from Philo van Kemenade

Web scraping
with Python