Web scraping
with Python



Philo van Kemenade - @phivk

General Assembly London

Recap



Variables  
What Types ?

Collections
Lists
Dictionaries

Recap



Loops  
for
while

Conditional Statements
if, elif, else

Recap


Functions
take input as arguments
return output

Modules
 extending behaviour

Script vs. Interpreter

Homework




pip install beautifulsoup4

This workshop


Getting data from the web

Web Scraping
is it weekend yet?
Books to scrape

Mini Project
Crypto converter
Your project

these slides




bit.ly/scrapingwithpython

Data on the web

Structured in databases

  • structured and indexed
  • hidden on the server-side of a web platform
  • may be accessible via API (Application Programming Interface)

Semi-structured on web pages
  • different structure for every page
  • available in your browser
  • extractable by scraping

Web Scraping

Extracting data from a web page’s source
For example: http://isitweekendyet.com/


Pseudo code


(0. Determine what data we are looking for)

     
    1. Read page HTML
    2. Parse raw HTML string into nicer format
    3. Extract what we’re looking for
    4. Process our extracted data
    5. Store/print/action based on data 

    0. Determine what you're looking for


    & where it lives on the page


    Pro Tip: use your browser's inspector


    1. Read page HTML


    from urllib.request import urlopen
    url = 'http://isitweekendyet.com/'
    pageSource = urlopen(url).read()
    

    2. Parse from Page to soup




    Beautiful Soup



    You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
    http://www.crummy.com/software/BeautifulSoup/

    Powered by Beautiful Soup

    Moveable Type at the NY Times (source)

    2. Parse from Page to soup


    from bs4 import BeautifulSoup
    weekendSoup = BeautifulSoup(pageSource, "html.parser")

    Now we have easy access to stuff like:
       
    >>> weekendSoup.title
     <title>Is it weekend yet?</title>                     




     The Shape of HTMl? 




     The Shape of HTMl 

    Objects in Soup

    Our soup is a tree of different types of objects:

    Tags

    Strings

    Tag

    an element in our HTML
    >>> tag = weekendSoup.div
    >>> tag
    <div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>
    >>> type(tag)
    <class 'bs4.element.Tag'> 

    String

    Text within a tag

    >>> tag.string
    u'\nYES!\n'
    >>> type(tag.string)  

    <class 'bs4.element.NavigableString'>








    .

    3. Extract what we’re looking for



    Navigating the parsed HTML tree


    Search functions


    Bonus: CSS Selectors


    Navigating the Soup Tree

    Navigating the Soup Tree

    . tagName
    .string
    .strings & .stripped_strings
    .contents & .children
    .descendants
     
    .parent & .parents 

    .next_sibling(s)
    .previous_sibling(s)
     

    Going Down the Tree

    .tagName

    access a tag by its name

    >>> weekendSoup.title
    <title>Is it weekend yet?</title>
    

    Going Down the Tree

    .string

    access the String inside a tag

    >>> weekendSoup.title.string
    'Is it weekend yet?'




    more below for reference...

    Going Down the Tree


    tag.strings
    tag.stripped_strings

    >>> for ss in body.div.stripped_strings: 
      print(ss)
    ... 
    YES!
    

    Going Down the Tree

    .contents & .children
    return a tag's direct children as list or generator
    >>> bodyTag = weekendSoup.body
    >>> bodyTag.contents
    [u'\n', <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>, u'\n', <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>, u'\n']
    
    <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>
    
    
    <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>

    Going Down the Tree

    .descendants
    return all of the tag's children, grand-children (etc)
    >>> for d in bodyTag.descendants: print d... 
    
    
    <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>
    
    YES!
    
    
    
    <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>
    
    
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    The weekend has landed!

    Going Up the Tree

    .parent & .parents
    access a tag's parent(s)
    >>> soup.a.parent.name
    u'div'
    >>> for p in soup.a.parents: print p.name
    ... 
    div
    body
    html
    [document]
    

    Going Sideways

    .next_sibling(s) & .previous_sibling(s)
    access a tag's brothers and sisters
    >>> weekendSoup.div.next_sibling
    u'\n'
    >>> weekendSoup.div.next_sibling.next_sibling
    <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>
    

    Exercise

    write a script:

    Use Beautiful Soup to navigate to the answer to our question:

    Is it weekend yet?

    '''A simple script that tells us if it's weekend yet'''
    
    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    # open webpage
    
    # parse HTML into Beautiful Soup
    
    # extract data from parsed soup
    
    # print answer
    
    

    Searching the Soup-Tree




    Using a filter in a search function

    to zoom into a part of the soup

    String Filter

    find by element name


    >>> weekendSoup.find_all('div')
    [<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    NO
    </div>, <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    don't worry, you'll get there!
    </div>]

    Search Functions




    find_all()

    find()

    using Find_all()


    find_all( name of tag )

    find_all( attribute filter )

    find_all( name of tagattribute filter)

    Tag name


    >>> weekendSoup.find_all('div')
    [<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    NO
    </div>, <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    don't worry, you'll get there!
    </div>]

    attribute filters



    urlBooks = 'http://books.toscrape.com/'
    pageSourceBooks = urlopen(urlBooks).read()
    booksSoup = BeautifulSoup(pageSourceBooks, "html.parser")
    
    soumissionLinks = booksSoup.find_all(
      'a',
      href='catalogue/soumission_998/index.html'
    )
    books = booksSoup.find_all('article', class_='product_pod')


    be careful to use "class_" when filtering based on class name


    Find()


    like find_all(), but limited to one result

    >>> booksSoup.find('title')
    <title>
    All products | Books to Scrape - Sandbox
    </title>

    find_all() returns a list
    find returns a tag

    Search All the directions!

    They work like find_all() & find()


    find_parents()
    find_parent()

    find_next_siblings()
    find_next_sibling()

    find_previous_siblings()
    find_previous_sibling()

    Exercise


    How many projects on (the first page of) the Book Store 

    have a 1 star rating?

    pro tip: use a search function

    books.toscrape.com


    Bonus:

    can you get the count for each of the different ratings?


    ⬇️ see template ⬇️

    same Template


    '''A simple script that scrapes book ratings'''
    
    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    # open webpage
    
    
    # parse HTML into Beautiful Soup
    
    
    # extract data from parsed soup

    Bonus: using Css Selectors

    select by id
    soup.select("#content")soup.select("div#content") 
    
    select by class
    soup.select(".byline")soup.select("li.byline") 
    
    select beneath tag
    soup.select("#content a") 
    select directly beneath tag
    soup.select("#content > a") 
    .

    Bonus: using Css Selectors

    check if attribute exists
    soup.select('a[href]')
    
    find by attribute value
    soup.select('a[href="http://www.theguardian.com/profile/brianlogan"]')
    attribute value starts with, ends with or contains
    soup.select('a[href^="http://www.theguardian.com/"]')
    soup.select('a[href$="info"]')
    [<a class="link-text" href="http://www.theguardian.com/info">About us,</a>, <a class="link-text" href="http://www.theguardian.com/info">About us</a>]
    >>> guardianSoup.select('a[href*=".com/contact"]')
    [<a class="rollover contact-link" href="http://www.theguardian.com/contactus/2120188" title="Displays contact data for guardian.co.uk"><img alt="" class="trail-icon" src="http://static.guim.co.uk/static/ac46d0fc9b2bab67a9a8a8dd51cd8efdbc836fbf/common/images/icon-email-us.png"/><span>Contact us</span></a>]

    4. Process extracted data


    we generally want to:


    clean up


    calculate


    process

    Cleaning up + processing


    >>> answer = soup.div.string
    >>> answer
    '\nNO\n'
    >>> cleaned = answer.strip()
    >>> cleaned
    'NO'
    >>> isWeekendYet = cleaned == 'YES'
    >>> isWeekendYet
    False 

    many useful string methods: https://docs.python.org/3/library/stdtypes.html#string-methods

    5. Do stuff with our extracted data



    For example print to screen:

    # print info to screen
    print('Is it weekend yet? ', isWeekendYet)
    

    or save to .csv file


    import csv
    
    with open('weekends.csv', 'w', newline='') as csvfile:
        weekendWriter = csv.writer(csvfile)
        weekendWriter.writerow(weekendYet) 
    

    https://docs.python.org/3.3/library/csv.html

    Mini Project

    Let's LiteCoin!



    Mini Project

    Build a BitCoin to GBP converter

    Get the current BitCoin/GBP exchange rate via


    use the input() function to get a user's input


    (let the slides help you)


    ⬇️ see template ⬇️

    All together Now

    fill in the blanks...
    '''A simple script that ... '''
    
    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    # open webpage
    url = 
    pageSource = 
    
    # parse HTML into Beautiful Soup
    mySoup = 
    
    # extract data from soup
    
    # clean up data
    
    # process data
    
    # action based on data




    Questions?

    Useful Resources


    google; “python” + your problem / question

    python.org/doc/; official python documentation, useful to find which functions are available

    stackoverflow.com; huge gamified help forum with discussions on all sorts of programming questions, answers are ranked by community

    codecademy.com/tracks/python; interactive exercises that teach you coding by doing

    wiki.python.org/moin/BeginnersGuide/Programmers; tools, lessons and tutorials

    Useful Modules etc

    mechanize: scraping behind logins & forms 

    Maths & Matrices with Numpy
    Data analysis with Pandas
    Plotting graphs with MatPlotLib

    Data Science:




     Thank you 





    Navigating to the weekend yet answer


    >>> from urllib.request import urlopen
    >>> from bs4 import BeautifulSoup
    >>> url = "http://isitweekendyet.com/"
    >>> source = urlopen(url).read()
    >>> soup = BeautifulSoup(source)
    
    >>> soup.body.div.string
    '\nNO\n'

    # an alternative:
    >>> list(soup.body.stripped_strings)[0]
    'NO'

    many routes possible...

    Book Store

    '''A simple script that scrapes info about Books'''

    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup

    # open webpage
    urlBooks = "http://books.toscrape.com/"
    pageSourceBooks = urlopen(urlBooks).read()

    # parse HTML into Beautiful Soup
    booksSoup = BeautifulSoup(pageSourceBooks, "html.parser")

    # extract data from parsed soup
    booksOneStar = booksSoup.find_all('p', class_="star-rating One")
    oneStarCount = len(booksOneStar)
    print(oneStarCount)

    #########
    # Bonus #
    #########

    # simple approach

    books1StarCount = len(booksSoup.find_all('p', class_="star-rating One"))
    books2StarCount = len(booksSoup.find_all('p', class_="star-rating Two"))
    books3StarCount = len(booksSoup.find_all('p', class_="star-rating Three"))
    books4StarCount = len(booksSoup.find_all('p', class_="star-rating Four"))
    books5StarCount = len(booksSoup.find_all('p', class_="star-rating Five"))

    print("1 star: ", books1StarCount)
    print("2 star: ", books2StarCount)
    print("3 star: ", books3StarCount)
    print("4 star: ", books4StarCount)
    print("5 star: ", books5StarCount)

    # more elegant approach

    def getStarCount(booksSoup, starClass):
    booksWithStarClass = booksSoup.find_all('p', class_="star-rating "+starClass)
    starCount = len(booksWithStarClass)
    return starCount

    starClasses = ["One", "Two", "Three", "Four", "Five"]

    starCounts = {}
    for starClass in starClasses:
    starCount = getStarCount(booksSoup, starClass)
    starCounts[starClass] = starCount

    print(starCounts)


    THE Bitcoin converter


     '''A simple script that converts BTC to GBP based on live rate'''

    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup

    # open webpage
    url = "https://exchangerate.guru/btc/"
    pageSource = urlopen(url).read()

    # parse HTML into Beautiful Soup
    btcSoup = BeautifulSoup(pageSource, "html.parser")

    # extract data from parsed soup

    gbpRateTag = btcSoup.find('a', href='/btc/gbp/1/')
    gbpRate = float(gbpRateTag.string)

    # get input from user

    noBitcoinString = input('How many Bitcoin have you got? ')
    noBitcoin = float(noBitcoinString)

    # calculate and print answer
    noGBP = noBitcoin * gbpRate

    print("WOW! you have " + str(noGBP) + "£!!!")

    A LiteCoin converter

    '''
    LiteCoin converter that tells us how much your LiteCoins are worth in EURO
    NB: expects python3
    '''

    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup

    def clean_up_rate(rateString):
    ''''Clean up raw rateString to form rate'''
    rateStringStripped = rateString.strip()
    rateNumber = rateStringStripped[1:10]
    return float(rateNumber)

    # open webpage
    url = "http://litecoinexchangerate.org/c/EUR"
    pageSource = urlopen(url).read()

    # turn html into beautiful soup
    liteCoinSoup = BeautifulSoup(pageSource, "html.parser")

    # extract info from soup
    rateString = liteCoinSoup.find('b').string

    # clean up data
    rate = clean_up_rate(rateString)

    # get user input
    litecoinsString = input("How many litecoins have you got?\n>>> ")
    litecoins = float(litecoinsString)

    # print output
    EURO = rate * litecoins
    print("You have", round(EURO,2), "EURO!")
    remember: readability counts!

    A bitcoin converter

    '''
    Bitcoin converter that tells us how much your bitcoins are worth in GBP
    NB: expects python3
    '''
    
    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    def clean_up_rate(rateString):
        ''''Clean up raw rateString to form rate'''
        rateNumber = rateString[1:10]
        return float(rateNumber)
    
    def main():
        '''Our main function that gets called when we run the program'''
        # open webpage
        url = "http://bitcoinexchangerate.org/c/GBP/1"
        webpage = urlopen(url).read()
    
        # turn html into beautiful soup
        bitcoinSoup = BeautifulSoup(webpage, "html.parser")
    
        # extract info from soup
        rateString = bitcoinSoup.find('b').string.strip()
    
        # clean up data
        rate = clean_up_rate(rateString)
    
        # get user input
        bitcoinsString = input("How many bitcoins have you got?\n>>> ")
        bitcoins = float(bitcoinsString)
    
        # print output
        GBP = rate * bitcoins
        print("You have", round(GBP,2), "GBP!")
    
    # this kicks off our program & lets us both run and import the program
    if __name__ == '__main__':
        main() 
    remember: readability counts!

    Webscraping with Python

    By Philo van Kemenade

    Webscraping with Python

    A practical introduction to webscraping with Python

    • 4,404