Web scraping
with Python

J rogel-salazar

@quantum_tunel / @DT_science

2020

General Assembly London

Recap

Variables

What Types ?

Collections
Lists
Dictionaries

Recap

Loops

for

while

Conditional Statements

if, elif, else

Recap

Functions

take input as arguments

return output

Modules

extending behaviour

Script vs. Interpreter

Homework

pip3 install beautifulsoup4

This workshop

Getting data from the web

Web Scraping
is it weekend yet?
GA Gallery

Mini Project
Bitcoin converter
Your project

these slides

http://bit.ly/pythonscraping

Data on the web

Structured in databases

structured and indexed
hidden on the server-side of a web platform
may be accessible via API (Application Programming Interface)

Semi-structured on web pages

different structure for every page
available in your browser
extractable by scraping

Web Scraping

Extracting data from a web page’s source
For example: http://isitweekendyet.com/

Pseudo code

(0. Determine what data we are looking for)

Read page HTML
Parse raw HTML string into nicer format
Extract what we’re looking for
Process our extracted data
Store/print/action based on data

Html?

HTML is a markup language for describing

web documents (web pages).

HTML stands for Hyper Text Markup Language

A markup language is a set of markup tags

HTML documents are described by HTML tags

Each HTML tag describes different document content

a small html doc

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

0. Determine what you're looking for

& where it lives on the page

Pro Tip: use your browser's inspector

1. Read page HTML

# Python 3
from urllib.request import urlopen
url = 'http://isitweekendyet.com/'
pageSource = urlopen(url).read()

# Python 2
from urllib import urlopen
url = 'http://isitweekendyet.com/'
pageSource = urlopen(url).read()

2. Parse from Page to soup

from bs4 import BeautifulSoup
weekendSoup = BeautifulSoup(pageSource, 'lxml')

Beautiful Soup

You didn't write that awful page.

You're just trying to get some data out of it.

Beautiful Soup is here to help.

Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.

http://www.crummy.com/software/BeautifulSoup/

Powered by Beautiful Soup

Moveable Type at the NY Times

http://www.nytimes.com/2007/10/25/arts/design/25vide.html

Powered by Beautiful Soup

Parse from Page to soup

>>> from bs4 import BeautifulSoup
>>> weekendSoup = BeautifulSoup(pageSource, 'lxml')

Now we have easy access to stuff like:

>>> weekendSoup.title
 <title>Is it weekend yet?</title>
>>> weekendSoup.title.string
 u'Is it weekend yet?'

The Shape of HTMl?

The Shape of HTMl

Objects in Soup

Our soup is a tree of different types of objects:

Tags

Strings

Comments

BeautifulSoup

Tag

an element in our HTML

>>> tag = weekendSoup.div
>>> tag
<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>
>>> type(tag)
<class 'bs4.element.Tag'>

String

Text within a tag

>>> tag.string
u'\nYES!\n'
>>> type(tag.string)  <class 'bs4.element.NavigableString'>

Beautiful Soup

the parsed document

>>> type(weekendSoup)    
                <class 'bs4.BeautifulSoup'>
>>> weekendSoup.nameu'[document]'

Comment

A special type of navigableString

>>> markup = "<b><!--This is a very special message--></b>"
>>> cSoup = BeautifulSoup(markup)
>>> comment = cSoup.b.string            
>>> type(comment)
<class 'bs4.element.Comment'>
>>> print(cSoup.b.prettify())
<b>         
 <!--This is a very special message-->       
</b>

3. Extract what we’re looking for

Navigating the parsed HTML tree

Search functions

Bonus: CSS Selectors

Navigating the Soup Tree

Navigating the Soup Tree

.string
.strings & .stripped_strings
.contents & .children
.descendants

.parent & .parents

.next_sibling(s)
.previous_sibling(s)

Going Down the Tree

.contents & .children

return a tag's direct children as list or generator

>>> bodyTag = weekendSoup.body
>>> bodyTag.contents
[u'\n', <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>, u'\n', <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>, u'\n']

<div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>


<div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>

Going Down the Tree

.string
.strings & .stripped_strings

access the String(s) inside a tag

>>> weekendSoup.div.string
u'\nYES!\n'

>>> for ss in weekednSoup.div.stripped_strings: 
  print(ss)
... 
YES!

Going Down the Tree

.descendants

return all of the tag's children, grand-children (etc)

>>> for d in bodyTag.descendants: print d... 


<div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
YES!
</div>

YES!



<div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>


<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
The weekend has landed!

Going Up the Tree

.parent & .parents

access a tag's parent(s)

>>> weekendSoup.a.parent.name
u'div'
>>> for p in weekendSoup.a.parents: print p.name
... 
div
body
html
[document]

Going Sideways

.next_sibling(s) & .previous_sibling(s)

access a tag's brothers and sisters

>>> weekendSoup.div.next_sibling
u'\n'
>>> weekendSoup.div.next_sibling.next_sibling
<div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
<a href="http://theweekendhaslanded.org">The weekend has landed!</a>
</div>

Exercise

write a script:

Use Beautiful Soup to navigate to the answer to our question:

Is it weekend yet?

'''A simple script that tells us if it's weekend yet'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage


# parse HTML into Beautiful Soup


# extract data from parsed soup

Searching the Soup-Tree

Using a filter in a search function

to zoom into a part of the soup

String Filter

most simple matching

>>> weekendSoup.find_all('div')
[<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
NO
</div>, <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
don't worry, you'll get there!
</div>]

Bonus: more filters

besides using a String as argument in a search function, you can also use:

Regular Expression

List

True

Function

more details: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters

Regular Expression Filter

smart matching (slightly out of today's scope)

>>> import re
>>> for tag in weekendSoup.find_all(re.compile("^b")):
...     print(tag.name)
... 
body

RE how to:

https://docs.python.org/2/howto/regex.html

List Filter

String match any item in the list

>>> weekendSoup.find_all(["a", "li"])



[<a href="http://theweekendhaslanded.org">The weekend has landed!</a>]
[<a href="http://theweekendhaslanded.org">The weekend has landed!</a>]

True Filter

Anything below except strings

>>> for tag in weekendSoup.find_all(True):
...     print(tag.name)
... 
html
head
title
meta
body
div
div
a

Function Filter

bespoke matching

>>> def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

>>> weekendSoup.find_all(has_class_but_no_id)

[]

Search Functions

find_all()

find()

using Find_all()

find_all( name of tag )

find_all( attribute filter )

find_all( name of tag, attribute filter)

searching by attribute filters

careful to use "class_" when filtering based on class name

urlGA = 'https://gallery.generalassemb.ly/'
pageSourceGA = urlopen(urlGA).read()
GASoup = BeautifulSoup(pageSourceGA, 'lxml')

wdiLinks = GASoup.find_all('a', href='WD')
projects = GASoup.find_all('li', class_='project')

Find()

like find_all(), but limited to one result

the following are equivalent:

>>> soup.find_all('title', limit=1)
[<title> Monty Python's reunion is about nostalgia and heroes, not comedy | Stage | theguardian.com </title>]
>>> soup.find('title')
<title> Monty Python's reunion is about nostalgia and heroes, not comedy | Stage | theguardian.com </title>

find_all() returns a list

find returns a tag

Search All the directions!

They work like find_all() & find()

find_parents()

find_parent()

find_next_siblings()

find_next_sibling()

find_previous_siblings()

find_previous_sibling()

Exercise

How many projects on (the first page of) the GA Gallery are from San Francisco?

pro tip: use a search function

https://gallery.generalassemb.ly/

Bonus:

which are all the unique locations that projects have come from?

Bonus: using Css Selectors

select by id

soup.select("#content")soup.select("div#content")

select by class

soup.select(".byline")soup.select("li.byline")

select beneath tag

soup.select("#content a")

select directly beneath tag

soup.select("#content > a")

Bonus: using Css Selectors

check if attribute exists

soup.select('a[href]')

find by attribute value

soup.select('a[href="http://www.theguardian.com/profile/brianlogan"]')

attribute value starts with, ends with or contains

soup.select('a[href^="http://www.theguardian.com/"]')
soup.select('a[href$="info"]')
[<a class="link-text" href="http://www.theguardian.com/info">About us,</a>, <a class="link-text" href="http://www.theguardian.com/info">About us</a>]
>>> guardianSoup.select('a[href*=".com/contact"]')
[<a class="rollover contact-link" href="http://www.theguardian.com/contactus/2120188" title="Displays contact data for guardian.co.uk"><img alt="" class="trail-icon" src="http://static.guim.co.uk/static/ac46d0fc9b2bab67a9a8a8dd51cd8efdbc836fbf/common/images/icon-email-us.png"/><span>Contact us</span></a>]

4. Process extracted data

We generally want to:

clean up

calculate

process

Cleaning up + processing

>>> answer = soup.div.string
>>> answer
'\nNO\n'
>>> cleaned = answer.strip()
>>> cleaned
'NO'
>>> isWeekendYet = cleaned == 'YES'
>>> isWeekendYet
False

many useful string methods: https://docs.python.org/3/library/stdtypes.html#string-methods

5. Do stuff with our extracted data

For example print to screen:

# print info to screen
print('Is it weekend yet? ', isWeekendYet)

or save to .csv file

import csv 

with open('weekend.csv', 'w', newline='') as csvfile:
    weekendWriter = csv.writer(csvfile)
    if isWeekendYet:
        weekendWriter.writerow(['Yes'])
    else:
        weekendWriter.writerow(['No'])

https://docs.python.org/3.3/library/csv.html

Mini Project

Let's Bitcoin!

Mini Project

Build a Bitcoin to GBP converter

Get the current Bitcoin/GBP exchange rate via

http://bitcoinexchangerate.org/c/GBP/1

use the input() function to get a user's input (raw_input in Python2)

(let the slides help you)

All together Now

fill in the blanks...

'''A simple script that ... '''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

# open webpage
url = 
pageSource = 

# parse HTML into Beautiful Soup
BitSoup = 

# extract data from soup

# clean up data

# process data

# action based on data

Questions?

Useful Resources

google; “python” + your problem / question

python.org/doc/; official python documentation, useful to find which functions are available

stackoverflow.com; huge gamified help forum with discussions on all sorts of programming questions, answers are ranked by community

codecademy.com/tracks/python; interactive exercises that teach you coding by doing

wiki.python.org/moin/BeginnersGuide/Programmers; tools, lessons and tutorials

Useful Modules etc

reading and writing files

reading and writing .csv

reading and writing .json

Maths & Matrices with Numpy

Data analysis with Pandas

Plotting graphs with MatPlotLib

Guide to web applications with Python

2 vs 3

Python Usage Survey 2014 visualised

http://www.randalolson.com/2015/01/30/python-usage-survey-2014/

Python 2 & 3 Key Differences

http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html

Thank you

J Rogel-Salazar

@quantum_tunnel / @dt_science

A bitcoin converter

'''
Bitcoin converter that tells us how much our bitcoins are worth in GBP
NB: expects python3
'''

# import modules
from urllib.request import urlopen
from bs4 import BeautifulSoup

def clean_up_rate(rateString):
    ''''Clean up raw rateString to form rate'''
    rateNumber = rateString[1:10]
    return float(rateNumber)

def main():
    '''Our main function that gets called when we run the program'''
    # open webpage
    url = "http://bitcoinexchangerate.org/c/GBP/1"
    webpage = urlopen(url).read()

    # turn html into beautiful soup
    bitcoinSoup = BeautifulSoup(webpage)

    # extract info from soup
    rateString = bitcoinSoup.find('b').string.strip()

    # clean up data
    rate = clean_up_rate(rateString)

    # get user input
    bitcoinsString = input("How many bitcoins have you got?\n>>> ")
    bitcoins = float(bitcoinsString)

    # print output
    GBP = rate * bitcoins
    print("You have", round(GBP,2), "GBP!")

# this kicks off our program & lets us both run and import the program
if __name__ == '__main__':
    main()

remember: readability counts!

Navigating to the weekend yet answer

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>> url = "http://isitweekendyet.com/"
>>> source = urlopen(url).read()
>>> soup = BeautifulSoup(source, 'lxml')

>>> soup.body.div.string
'\nNO\n'

# an alternative:
>>> list(soup.body.stripped_strings)[0]
'NO'

many routes possible...

Searching the GA Gallery

GAurl = "https://gallery.generalassemb.ly/"
GAsource = urlopen(GAurl).read()
GAsoup = BeautifulSoup(GAsource, 'lxml')

def countLondon():
    londonProjects = GAsoup.find_all('a', href='/?metro=london')
    londonCount = len(londonProjects)
    return londonCount

def getUniqueLocations():
    metros = GAsoup.find_all('a', class_='metro')
    uniqueLocations = []
    for metro in metros:
        location = metro.string
        if location not in uniqueLocations:
            uniqueLocations.append(location)
    return uniqueLocations

Bonus: count projects per location

def getLocationCounts():
    metros = GAsoup.find_all('a', class_='metro')
    locationCounts = {}
    for metro in metros:
        location = metro.string
        if location not in locationCounts:
            locationCounts[location] = 1
        else:
            locationCounts[location] = locationCounts[location] + 1
    return locationCounts   

# pro tip: pretty print the result
from pprint import pprint
locationCounts = getLocationCounts()
pprint(locationCounts)

Webscraping with Python

By J Rogel

Webscraping with Python

A practical introduction to webscraping with Python

1,837

J Rogel

Data scientist, physicist, numerical analyst, machine learner, human

Web scrapingwith Python

J rogel-salazar

@quantum_tunel / @DT_science

2020

General Assembly London

Recap

Recap

Recap

Homework

This workshop

these slides

Data on the web

Web Scraping

Pseudo code

Html?

a small html doc

0. Determine what you're looking for

& where it lives on the page

Pro Tip: use your browser's inspector

1. Read page HTML

2. Parse from Page to soup

Beautiful Soup

Powered by Beautiful Soup

Powered by Beautiful Soup

Parse from Page to soup

The Shape of HTMl?

The Shape of HTMl

Objects in Soup

Tag

String

Beautiful Soup

Comment

3. Extract what we’re looking for

Navigating the parsed HTML tree

Search functions

Bonus: CSS Selectors

Navigating the Soup Tree

Navigating the Soup Tree

Going Down the Tree

Going Down the Tree

Going Down the Tree

Going Up the Tree

Going Sideways

Exercise

Searching the Soup-Tree

String Filter

Bonus: more filters

Regular Expression Filter

List Filter

True Filter

Function Filter

Search Functions

using Find_all()

searching by attribute filters

Find()

Search All the directions!

Exercise

Bonus: using Css Selectors

Bonus: using Css Selectors

4. Process extracted data

Cleaning up + processing

5. Do stuff with our extracted data

or save to .csv file

Mini Project

Mini Project

All together Now

Questions?

Useful Resources

Useful Modules etc

2 vs 3

Thank you

A bitcoin converter

Navigating to the weekend yet answer

Searching the GA Gallery

Bonus: count projects per location

Webscraping with Python

More from J Rogel

Web scraping
with Python