Web scraping & Data 
Analysis with Python


J rogel-salazar

  @quantum_tunel / @DT_science

 Jun 2017

General Assembly London

Recap



Variables  
What Types ?

Collections
Lists
Dictionaries

Recap



Loops  
for
while

Conditional Statements
if, elif, else

Recap


Functions
take input as arguments
return output

Modules
 extending behaviour

Script vs. Interpreter

Homework




pip3 install beautifulsoup4

This workshop


Getting data from the web

Web Scraping
is it weekend yet?


Mini Project
Iris dataset
Your project

these slides




http://bit.ly/pythonscrapingdata

Data on the web

Structured in databases

  • structured and indexed
  • hidden on the server-side of a web platform
  • may be accessible via API (Application Programming Interface)

Semi-structured on web pages
  • different structure for every page
  • available in your browser
  • extractable by scraping

Web Scraping

Extracting data from a web page’s source
For example: http://isitweekendyet.com/


Pseudo code


(0. Determine what data we are looking for)

     
    1. Read page HTML
    2. Parse raw HTML string into nicer format
    3. Extract what we’re looking for
    4. Process our extracted data
    5. Store/print/action based on data 

    Html?

    HTML is a markup language for describing 
    web documents (web pages). 

     HTML stands for Hyper Text Markup Language 

    A markup language is a set of markup tags 

    HTML documents are described by HTML tags 

    Each HTML tag describes different document content

    a small html doc

    <!DOCTYPE html>
    <html>
    <head>
    <title>Page Title</title>
    </head>
    <body>

    <h1>My First Heading</h1>
    <p>My first paragraph.</p>

    </body>
    </html>

    0. Determine what you're looking for


    & where it lives on the page


    Pro Tip: use your browser's inspector


    1. Read page HTML


    # Python 3
    from urllib.request import urlopen
    url = 'http://isitweekendyet.com/'
    pageSource = urlopen(url).read()
    
    # Python 2
    from urllib import urlopen
    url = 'http://isitweekendyet.com/'
    pageSource = urlopen(url).read()
    

    2. Parse from Page to soup



    from bs4 import BeautifulSoup
    weekendSoup = BeautifulSoup(pageSource, 'lxml')

    Beautiful Soup



    You didn't write that awful page. 
    You're just trying to get some data out of it. 
    Beautiful Soup is here to help. 
    Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects.
    http://www.crummy.com/software/BeautifulSoup/

    Powered by Beautiful Soup


    Moveable Type at the NY Times

    Powered by Beautiful Soup



    Parse from Page to soup


    >>> from bs4 import BeautifulSoup
    >>> weekendSoup = BeautifulSoup(pageSource, 'lxml')

    Now we have easy access to stuff like:
       
    >>> weekendSoup.title
     <title>Is it weekend yet?</title>
    >>> weekendSoup.title.string
     u'Is it weekend yet?'                        




     The Shape of HTMl? 




     The Shape of HTMl 

    Objects in Soup

    Our soup is a tree of different types of objects:

    Tags

    Strings

    Comments

    BeautifulSoup

    Tag

    an element in our HTML
    >>> tag = weekendSoup.div
    >>> tag
    <div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>
    >>> type(tag)
    <class 'bs4.element.Tag'> 

    String

    Text within a tag

    >>> tag.string
    u'\nYES!\n'
    >>> type(tag.string)  

    <class 'bs4.element.NavigableString'>








    .

    Beautiful Soup

    the parsed document

    >>> type(weekendSoup)    
                    

    <class 'bs4.BeautifulSoup'>

    >>> 
    weekendSoup.name
    u'[document]'

    Comment

    A special type of navigableString

    >>> markup = "<b><!--This is a very special message--></b>"

    >>> cSoup = BeautifulSoup(markup)

    >>> comment = cSoup.b.string

    >>> type(comment)

    <class 'bs4.element.Comment'>

    >>> print(cSoup.b.prettify())

    <b>

     <!--This is a very special message-->

    </b>


    3. Extract what we’re looking for



    Navigating the parsed HTML tree


    Search functions


    Bonus: CSS Selectors


    Navigating the Soup Tree

    Navigating the Soup Tree

    .string
    .strings & .stripped_strings
    .contents & .children
    .descendants


    .parent & .parents 

    .next_sibling(s)
    .previous_sibling(s)

    Going Down the Tree

    .contents & .children
    return a tag's direct children as list or generator
    >>> bodyTag = weekendSoup.body
    >>> bodyTag.contents
    [u'\n', <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>, u'\n', <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>, u'\n']
    
    <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>
    
    
    <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>

    Going Down the Tree

    .string
    .strings & .stripped_strings
    access the String(s) inside a tag
    >>> weekendSoup.div.string
    u'\nYES!\n'
    
    >>> for ss in weekednSoup.div.stripped_strings: 
      print(ss)
    ... 
    YES!
    





    .

    Going Down the Tree

    .descendants
    return all of the tag's children, grand-children (etc)
    >>> for d in bodyTag.descendants: print d... 
    
    
    <div class="answer text" id="answer" style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    YES!
    </div>
    
    YES!
    
    
    
    <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>
    
    
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    The weekend has landed!

    Going Up the Tree

    .parent & .parents
    access a tag's parent(s)
    >>> weekendSoup.a.parent.name
    u'div'
    >>> for p in weekendSoup.a.parents: print p.name
    ... 
    div
    body
    html
    [document]
    

    Going Sideways

    .next_sibling(s) & .previous_sibling(s)
    access a tag's brothers and sisters
    >>> weekendSoup.div.next_sibling
    u'\n'
    >>> weekendSoup.div.next_sibling.next_sibling
    <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    <a href="http://theweekendhaslanded.org">The weekend has landed!</a>
    </div>
    

    Exercise

    Write a script:

    Use Beautiful Soup to navigate to the answer to our question:

    Is it weekend yet?

    '''A simple script that tells us if it's weekend yet'''
    
    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    # open webpage
    
    
    # parse HTML into Beautiful Soup
    
    
    # extract data from parsed soup
    
    
    

    Searching the Soup-Tree




    Using a filter in a search function

    to zoom into a part of the soup






    String Filter

    most simple matching


    >>> weekendSoup.find_all('div')
    [<div style="font-weight: bold; font-size: 120pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: black;">
    NO
    </div>, <div style="font-size: 5pt; font-family: Helvetica Neue, Helvetica, Swis721 BT, Arial, sans-serif; text-decoration: none; color: gray;">
    don't worry, you'll get there!
    </div>]

    Bonus: more filters


    besides using a String as argument in a search function, you can also use:


    Regular Expression

    List

    True

    Function

    more details: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters


    Regular Expression Filter

    smart matching (slightly out of today's scope)

    >>> import re

    >>> for tag in weekendSoup.find_all(re.compile("^b")): ...     print(tag.name) ...  body
    RE how to:

    List Filter

    String match any item in the list

    >>> weekendSoup.find_all(["a", "li"])


    [<a href="http://theweekendhaslanded.org">The weekend has landed!</a>]

    [<a href="http://theweekendhaslanded.org">The weekend has landed!</a>]

    True Filter

    Anything below except strings

    >>> for tag in weekendSoup.find_all(True): ... print(tag.name) ... html head title meta body div div a


    Function Filter

    bespoke matching
    >>> def has_class_but_no_id(tag):
        return tag.has_attr('class') and not tag.has_attr('id')
    
    >>> weekendSoup.find_all(has_class_but_no_id)
    
    []

    Search Functions




    find_all()

    find()

    using Find_all()


    find_all( name of tag )

    find_all( attribute filter )

    find_all( name of tagattribute filter)

    searching by attribute filters


    careful to use "class_" when filtering based on class name

    urlGA = 'https://gallery.generalassemb.ly/'
    pageSourceGA = urlopen(urlGA).read()
    GASoup = BeautifulSoup(pageSourceGA, 'lxml')
    
    wdiLinks = GASoup.find_all('a', href='WD')
    projects = GASoup.find_all('li', class_='project')
    

    Find()


    like find_all(), but limited to one result

    the following are equivalent:
    >>> soup.find_all('title', limit=1)
    [<title> Monty Python's reunion is about nostalgia and heroes, not comedy | Stage | theguardian.com </title>]
    >>> soup.find('title')
    <title> Monty Python's reunion is about nostalgia and heroes, not comedy | Stage | theguardian.com </title>
    find_all() returns a list
    find returns a tag

    Search All the directions!

    They work like find_all() & find()


    find_parents()
    find_parent()

    find_next_siblings()
    find_next_sibling()

    find_previous_siblings()
    find_previous_sibling()

    Bonus: using Css Selectors

    select by id
    soup.select("#content")soup.select("div#content") 
    
    select by class
    soup.select(".byline")soup.select("li.byline") 
    
    select beneath tag
    soup.select("#content a") 
    select directly beneath tag
    soup.select("#content > a") 

    Bonus: using Css Selectors

    check if attribute exists
    soup.select('a[href]')
    
    find by attribute value
    soup.select('a[href="http://www.theguardian.com/profile/brianlogan"]')
    attribute value starts with, ends with or contains
    soup.select('a[href^="http://www.theguardian.com/"]')
    soup.select('a[href$="info"]')
    [<a class="link-text" href="http://www.theguardian.com/info">About us,</a>, <a class="link-text" href="http://www.theguardian.com/info">About us</a>]
    >>> guardianSoup.select('a[href*=".com/contact"]')
    [<a class="rollover contact-link" href="http://www.theguardian.com/contactus/2120188" title="Displays contact data for guardian.co.uk"><img alt="" class="trail-icon" src="http://static.guim.co.uk/static/ac46d0fc9b2bab67a9a8a8dd51cd8efdbc836fbf/common/images/icon-email-us.png"/><span>Contact us</span></a>]

    4. Process extracted data


    We generally want to:


    clean up


    calculate


    process

    Cleaning up + processing


    >>> answer = soup.div.string
    >>> answer
    '\nNO\n'
    >>> cleaned = answer.strip()
    >>> cleaned
    'NO'
    >>> isWeekendYet = cleaned == 'YES'
    >>> isWeekendYet
    False 

    many useful string methods: https://docs.python.org/3/library/stdtypes.html#string-methods

    5. Do stuff with our extracted data



    For example print to screen:

    # print info to screen
    print('Is it weekend yet? ', isWeekendYet)
    

    or save to .csv file


    import csv 
    
    with open('weekend.csv', 'w', newline='') as csvfile:
        weekendWriter = csv.writer(csvfile)
        if isWeekendYet:
            weekendWriter.writerow(['Yes'])
        else:
            weekendWriter.writerow(['No'])
    

    https://docs.python.org/3.3/library/csv.html

    Mini Project

    Extract tabular data from the web


    Get the table of data from the Iris data set in

    Use the find_all function and look at the class for the table. 

    Also check the tags used to delimit header and rows in the table.

    (let the slides and code help you)

    Mini Project

    Some steps to help you 

    1. As before open the website and load it to an object called "IrisSoup"  
    2. Check the type of class that the table has in the page and use find to get the table 
    3. Iterate through each row (tr) and assign each element 
      • (get_text() and strip()) to a variable and append it to a list. 
    4. Make a list of lists with the output of step 3 above
      • NOTE: Consider using a nested list comprehension to do steps 3 and 4 in one go!
    5. What would be useful to do with the headers (th)

    (let the slides help you)

    Let us analyse the iris dataset


    We will use the pandas library to do this




    Questions?

    Useful Resources


    google; “python” + your problem / question

    python.org/doc/; official python documentation, useful to find which functions are available

    stackoverflow.com; huge gamified help forum with discussions on all sorts of programming questions, answers are ranked by community

    codecademy.com/tracks/python; interactive exercises that teach you coding by doing

    wiki.python.org/moin/BeginnersGuide/Programmers; tools, lessons and tutorials

    Useful Modules etc


    Maths & Matrices with Numpy
    Data analysis with Pandas
    Plotting graphs with MatPlotLib

    2 vs 3


    Python Usage Survey 2014 visualised

    http://www.randalolson.com/2015/01/30/python-usage-survey-2014/


    Python 2 & 3 Key Differences

    http://sebastianraschka.com/Articles/2014_python_2_3_key_diff.html




     Thank you 


    J Rogel-Salazar



    Extracting Iris Dataset     

    '''
    Extracting the Iris dataset table from Wikipedia
    '''
    # import modules
    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    # open webpage
    url = 'https://en.wikipedia.org/wiki/Iris_flower_data_set'
    pageSource = urlopen(url).read()
    
    # parse HTML into Beautiful Soup
    IrisSoup = BeautifulSoup(pageSource, 'lxml')
    
    # Get the table
    right_table=IrisSoup.find('table', class_='wikitable sortable')
    
    # Extract rows 
    tmp = right_table.find_all('tr')
    first = tmp[0]
    allRows = tmp[1:]
    
    # Construct headers
    headers = [header.get_text().strip() for header in first.find_all('th')]
    
    # Construct results
    results = [[data.get_text() for data in row.find_all('td')] 
               for row in allRows]
    remember: readability counts! 

    Creating a pandas dataframe


    #import pandas to convert list to data frame
    import pandas as pd
    df = pd.DataFrame(data = results, 
                      columns = headers)
    
    df['Species'] = df['Species'].map(lambda x: x.replace('\xa0',' '))

    Pandas Analysis

    import pandas as pd
    %pylab inline
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    iris_data = pd.read_csv('irisdata.csv') 
    
    iris_data.head()
    
    iris_data.shape
    
    iris_data[0:10][['Sepal width', 'Sepal length' ]]
    
    # Summarise the data 
    iris_data.describe()
    
    # Now let's group the data by the species 
    byspecies = iris_data.groupby('Species') 
    
    byspecies.describe()
    
    byspecies['Petal length'].mean()
    
    # Histograms
    iris_data.loc[iris_data['Species'] == 'I. setosa', 'Sepal width'].hist(bins=10)
    
    iris_data['Sepal width'].plot(kind="hist")
    
    
    
    
    
    
    
    
    
    
    
    

    Navigating to the weekend yet answer


    >>> from urllib.request import urlopen
    >>> from bs4 import BeautifulSoup
    >>> url = "http://isitweekendyet.com/"
    >>> source = urlopen(url).read()
    >>> soup = BeautifulSoup(source, 'lxml')
    
    >>> soup.body.div.string
    '\nNO\n'

    # an alternative:
    >>> list(soup.body.stripped_strings)[0]
    'NO'

    many routes possible...

    Webscraping & Data Analysis with Python

    By J Rogel

    Webscraping & Data Analysis with Python

    A practical introduction to webscraping with Python

    • 1,343