Web scraping in Python  101


By Yasoob Khalid

Who am I?


I am Muhammad Yasoob Ullah Khalid

  • a programmer 
  • a high school student 
  • a blogger 
  • Pythonista 
  • and tea lover

My experience


  • Creator of freepythontips
  • Made a couple of open source programs
  • A  contributor to youtube-dl 
  • Teaching programming at my school to my friends
  • It's my first conference!

What this talk is about?

  • This talk is about Web Scraping
  • Which libraries are available for the job 
  • Which library is best for which job
  • An intro to Scrapy
  • When and when NOT to use Scrapy

What is Web Scraping ?

Web scraping (web harvesting or web data extraction) is a  computer software technique of extracting information from   websites. Usually, such software programs simulate human  exploration of the World Wide Web by either implementing   low-level Hypertext Transfer Protocol (HTTP), or embedding   a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. - Wikipedia

In simple words

It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.

Through web scraping we can extract any data which we can see while browsing the web.

Usage of web scraping in real life


  • Extract product information
  • Extract job postings and internships
  • Extract offers and discounts from deal-of-the-day websites
  • Crawl forums and social websites
  • Extract data to make a search engine
  • Gathering weather data 
  • etc.

Advantages of Web Scraping over

using an API 


  • Web Scraping is not rate limited
  • Anonymously access the website and gather data
  • Some websites do not have an API
  • Some data is not accessible through an API
  • and many more !

Essential parts of Web Scraping

Web Scraping follows this workflow:
  • Get the website - using HTTP library
  • Parse the html document - using any parsing library
  • Store the results - either a db, csv, text file, etc

We will focus more on parsing.

Libraries available for the job 

Some of the most widely known libraries used for web scraping are:


  • BeautifulSoup 
  • lxml 
  • re ( not really for web scraping, I will explain later ) 
  • Scrapy ( a complete framework )
  • Some HTTP libraries for web scraping

    • Requests
    r = requests.get('https://www.google.com').html 
    • urllib and urllib2 
    html = urllib2.urlopen('http://python.org/').read()
    
    • httplib and httplib2 
    h = httplib2.Http(".cache")
    (resp_headers, content) = h.request("http://example.org/", "GET") 

    Parsing libraries


    • BeautifulSoup
    tree = BeautifulSoup(html_doc)
    tree.title 
    • lxml
    tree = lxml.html.fromstring(html_doc)
    title = tree.xpath('/title/text()') 
    • re
    title = re.findall('<title>(.*?)</title>', html_doc) 

    BeautifulSoup

    • A beautiful API
    soup = BeautifulSoup(html_doc)
    last_a_tag = soup.find("a", id="link3")
    all_b_tags = soup.find_all("b") 
    • very easy to use
    • can handle broken markup
    • purely in Python
    • slow :(

     lxml

    The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed. 

    • very fast
    • not purely in Python
    • If you have no "pure Python" requirement use lxml
    • lxml works with all python versions from 2.4 to 3.3

    re

    re is the regex library for Python. It is used only to extract minute amount of text. Entire HTML parsing is not possible with regular expressions. Its unpopularity is due to:
    • requires you to learn its symbols e.g 
    '.',*,$,^,\b,\w 
    • can become complex

    However it is
    • purely baked in Python 
    • a part of standard library
    • very fast - I will show later
    • supports every Python version

    Comparison of lxml, re and BeautifulSoup

    import re
    import time
    import urllib2
    from bs4 import BeautifulSoup
    from lxml import html as lxmlhtml
    
    def timeit(fn, *args):
        t1 = time.time()
        for i in range(100):
            fn(*args)
        t2 = time.time()
        print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
        
    def bs_test(html):
        soup = BeautifulSoup(html)
        return soup.html.head.title
        
    def lxml_test(html):
        tree = lxmlhtml.fromstring(html)
        return tree.xpath('//title')[0].text_content()
        
    def regex_test(html):
        return re.findall('', html)[0]
        
    if __name__ == '__main__':
        url = 'http://freepythontips.wordpress.com/'
        html = urllib2.urlopen(url).read()
        for fn in (bs_test, lxml_test, regex_test):
            timeit(fn, html) 

    The result 

    yasoob@yasoob:~/Desktop$ python test.py
    bs_test took 1851.457 ms
    lxml_test took 232.942 ms
    regex_test took 7.186 ms 
    
    
    • lxml took 32x more time than re
    • BeautifulSoup took 245x! more time than re

    What to do when your scraping needs are very high?

    • You want to scrape millions of web pages everyday. 
    • You want to make a broad scale web scraper. 
    • You want to use something that is thoroughly tested 
    • Is there any solution ?

    Yes there is a solution!


    One word, use Scrapy!

    • Scrapy is very fast

    • It's a full blown away throughly tested framework

    • It's asynchronous

    • It's easy to use

    • It has everything you need to start scraping

    • It's made in Python 

    How does Scrapy compare to BeautifulSoup or lxml?

    BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django - scrapy docs
    Scrapy only supports Python 2.7 and NOT 3.x

    When to use Scrapy


    • When you have to scrape millions of pages
    • When you want asynchronous support out of the box
    • When you don't want to reinvent the wheel
    • When you are not afraid to learn something new

    If you are not willing to risk the unusual, you will have to settle for the ordinary – Jim Rohn

    Starting out with Scrapy


    The workflow in Scrapy:
    • Define a scraper
    • Define the items you are going to extract
    • Define the items pipeline (Optional)
    • Run the scraper

    Note: I will just demonstrate the basic building blocks of Scrapy. In Scrapy a scraper is called a spider

    Using the Scrapy command-line tool

    scrapy startproject tutorial 
    
    • This will create the following directory and files:
    tutorial
    ├── scrapy.cfg
    └── tutorial
        ├── __init__.py
        ├── items.py
        ├── pipelines.py
        ├── settings.py
        └── spiders
            └── __init__.py
     
    • A project can have multiple spiders

    What is an Item?

    Items are containers that will be loaded with the scraped data. They work like simple Python dicts but provide additional protecting against populating undeclared fields, to prevent typos.

    • Declaring an Item class:
    #tutorial/tutorial/items.py
    
    import scrapy
    
    class DmozItem(scrapy.Item):
        title = scrapy.Field()
        link = scrapy.Field()
        description = scrapy.Field() 
    

    Extracting data

    • Use the scrapy shell to test scraping:
    scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ 


    • Scrapy provides xpaths, css selectors and regex to extract data
    • Extracting the title using xpath:
    title = sel.xpath('//title/text()').extract() 
    • That's it!

    Writing the first scraper

    A spider is a class written by the user to scrape data from a website.  Writing a spider is easy. Just follow these steps:

    • Subclass
     scrapy.Spider 
    • Define start_urls list
    • Define the parse method in your spider

    Full Spider

    #tutorial/tutorial/spiders/spider.py
    
    import scrapy
    from tutorial.items import DmozItem
    
    class DmozSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
            "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
        ]
    
        def parse(self, response):
            for sel in response.xpath('//ul/li'):
                item = DmozItem()
                item['title'] = sel.xpath('a/text()').extract()
                item['link'] = sel.xpath('a/@href').extract()
                item['desc'] = sel.xpath('text()').extract()
                yield item
    

    Unleash the Scrapy power!

    scrapy crawl dmoz 
    
    2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
    2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
    2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
    2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
    2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
    2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200)  (referer: None)
    2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200)  (referer: None)
    2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished) 

    Storing the scraped data

    You have two choices:
    • Use feed export
    • Define Item pipelines

    • Using feed export:
    scrapy crawl dmoz -o items.json 
    • Item pipelines are a separate topic and will be covered in future

    When not to use Scrapy

    • You are just making a throw away script
    • You want to crawl a small number of pages.
    • You want something simple.
    • You want to reinvent the wheel and want to learn the basics

    So what should you use?

    • So if you want to make a script which does not have to extract a lot of information and if you are not afraid of learning something new then use re.
    • If you want to extract a lot of data and do not have a "pure Python" library requirement then use lxml
    • If you want to extract information from broken markup then use BeautifulSoup.
    • If you want to scrape a lot of pages and want to use a mature scraping framework then use Scrapy.

    What do I prefer ?

    Seriously speaking! I prefer re and Scrapy. I started web scraping with BeautifulSoup as it was the easiest. Then I used lxml and soon found BeautifulSoup slow. Then I used re for some time and fell in love with it. I use scrapy only to make large scrapers or when I need to get a lot of data. Once I used scrapy to scrape 69,000 torrent links from a website.

    What is youtube-dl ?

    It is a python script that allows you to download videos and music from various websites like :
    • Facebook
    • YouTube
    • Vimeo
    • Dailymotion
    • Metacafe
    • and almost 300 more ! 

    Well that was it !

    I hope you learned something about web scraping. It was my first conference so forgive me for any mistakes. If you want to talk to me later then meet me outside. If you want to ask something then don't hesitate and I will try to answer.


    Questions?


    Facebook== fb.me/m.yasoob
    Twitter== @yasoobkhalid
    Blog ==  https://freepythontips.wordpress.com/
    Email == yasoob.khld@gmail.com
    Slides == http://slides.com/myasoobkhalid/web-scraping

    Web scraping

    By M Yasoob Khalid

    Web scraping

    A presentation about web scraping which I will be giving at EuroPython'14 in Berlin, Germany. If you have any suggestions then do comment bellow.

    • 4,301