Web scraping in Python 101

By Yasoob Khalid

Who am I?

I am Muhammad Yasoob Ullah Khalid

a programmer
a high school student
a blogger
Pythonista
and tea lover

My experience

Creator of freepythontips
Made a couple of open source programs
A contributor to youtube-dl
Teaching programming at my school to my friends
It's my first conference!

What this talk is about?

This talk is about Web Scraping
Which libraries are available for the job
Which library is best for which job
An intro to Scrapy
When and when NOT to use Scrapy

What is Web Scraping ?

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox. - Wikipedia

In simple words

It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.

Through web scraping we can extract any data which we can see while browsing the web.

Usage of web scraping in real life

Extract product information
Extract job postings and internships
Extract offers and discounts from deal-of-the-day websites
Crawl forums and social websites
Extract data to make a search engine
Gathering weather data
etc.

Advantages of Web Scraping over

using an API

Web Scraping is not rate limited
Anonymously access the website and gather data
Some websites do not have an API
Some data is not accessible through an API
and many more !

Essential parts of Web Scraping

Web Scraping follows this workflow:

Get the website - using HTTP library

Parse the html document - using any parsing library
Store the results - either a db, csv, text file, etc

We will focus more on parsing.

Libraries available for the job

Some of the most widely known libraries used for web scraping are:

BeautifulSoup

lxml

re ( not really for web scraping, I will explain later )

Scrapy ( a complete framework )

Some HTTP libraries for web scraping

Requests

r = requests.get('https://www.google.com').html

urllib and urllib2

html = urllib2.urlopen('http://python.org/').read()

httplib and httplib2

h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://example.org/", "GET")

Parsing libraries

BeautifulSoup

tree = BeautifulSoup(html_doc)
tree.title

lxml

tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()')

title = re.findall('<title>(.*?)</title>', html_doc)

BeautifulSoup

A beautiful API

soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")

very easy to use
can handle broken markup
purely in Python
slow :(

lxml

The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed.

very fast
not purely in Python
If you have no "pure Python" requirement use lxml
lxml works with all python versions from 2.4 to 3.3

re

re is the regex library for Python. It is used only to extract minute amount of text. Entire HTML parsing is not possible with regular expressions. Its unpopularity is due to:

requires you to learn its symbols e.g

'.',*,$,^,\b,\w

can become complex

However it is

purely baked in Python
a part of standard library
very fast - I will show later
supports every Python version

Comparison of lxml, re and BeautifulSoup

import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml

def timeit(fn, *args):
    t1 = time.time()
    for i in range(100):
        fn(*args)
    t2 = time.time()
    print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
    
def bs_test(html):
    soup = BeautifulSoup(html)
    return soup.html.head.title
    
def lxml_test(html):
    tree = lxmlhtml.fromstring(html)
    return tree.xpath('//title')[0].text_content()
    
def regex_test(html):
    return re.findall('', html)[0]
    
if __name__ == '__main__':
    url = 'http://freepythontips.wordpress.com/'
    html = urllib2.urlopen(url).read()
    for fn in (bs_test, lxml_test, regex_test):
        timeit(fn, html)

The result

yasoob@yasoob:~/Desktop$ python test.py
bs_test took 1851.457 ms
lxml_test took 232.942 ms
regex_test took 7.186 ms

lxml took 32x more time than re
BeautifulSoup took 245x! more time than re

What to do when your scraping needs are very high?

You want to scrape millions of web pages everyday.
You want to make a broad scale web scraper.
You want to use something that is thoroughly tested
Is there any solution ?

Yes there is a solution!

One word, use Scrapy!

Scrapy is very fast
It's a full blown away throughly tested framework
It's asynchronous
It's easy to use
It has everything you need to start scraping
It's made in Python

How does Scrapy compare to BeautifulSoup or lxml?

BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django - scrapy docs.

Scrapy only supports Python 2.7 and NOT 3.x

When to use Scrapy

When you have to scrape millions of pages
When you want asynchronous support out of the box
When you don't want to reinvent the wheel
When you are not afraid to learn something new

If you are not willing to risk the unusual, you will have to settle for the ordinary – Jim Rohn

Starting out with Scrapy

The workflow in Scrapy:

Define a scraper
Define the items you are going to extract
Define the items pipeline (Optional)
Run the scraper

Note: I will just demonstrate the basic building blocks of Scrapy. In Scrapy a scraper is called a spider

Using the Scrapy command-line tool

scrapy startproject tutorial

This will create the following directory and files:

tutorial
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

A project can have multiple spiders

What is an Item?

Items are containers that will be loaded with the scraped data. They work like simple Python dicts but provide additional protecting against populating undeclared fields, to prevent typos.

Declaring an Item class:

#tutorial/tutorial/items.py

import scrapy

class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    description = scrapy.Field()

Extracting data

Use the scrapy shell to test scraping:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

Scrapy provides xpaths, css selectors and regex to extract data
Extracting the title using xpath:

title = sel.xpath('//title/text()').extract()

That's it!

Writing the first scraper

A spider is a class written by the user to scrape data from a website. Writing a spider is easy. Just follow these steps:

Subclass

 scrapy.Spider

Define start_urls list
Define the parse method in your spider

Full Spider

#tutorial/tutorial/spiders/spider.py

import scrapy
from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            yield item

Unleash the Scrapy power!

scrapy crawl dmoz

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200)  (referer: None)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200)  (referer: None)
2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)

Storing the scraped data

You have two choices:

Use feed export
Define Item pipelines

Using feed export:

scrapy crawl dmoz -o items.json

Item pipelines are a separate topic and will be covered in future

When not to use Scrapy

You are just making a throw away script
You want to crawl a small number of pages.
You want something simple.
You want to reinvent the wheel and want to learn the basics

So what should you use?

So if you want to make a script which does not have to extract a lot of information and if you are not afraid of learning something new then use re.
If you want to extract a lot of data and do not have a "pure Python" library requirement then use lxml
If you want to extract information from broken markup then use BeautifulSoup.
If you want to scrape a lot of pages and want to use a mature scraping framework then use Scrapy.

What do I prefer ?

Seriously speaking! I prefer re and Scrapy. I started web scraping with BeautifulSoup as it was the easiest. Then I used lxml and soon found BeautifulSoup slow. Then I used re for some time and fell in love with it. I use scrapy only to make large scrapers or when I need to get a lot of data. Once I used scrapy to scrape 69,000 torrent links from a website.

What is youtube-dl ?

It is a python script that allows you to download videos and music from various websites like :

Facebook
YouTube
Vimeo
Dailymotion
Metacafe
and almost 300 more !

Well that was it !

I hope you learned something about web scraping. It was my first conference so forgive me for any mistakes. If you want to talk to me later then meet me outside. If you want to ask something then don't hesitate and I will try to answer.

Questions?

Facebook== fb.me/m.yasoob

Twitter== @yasoobkhalid

Blog == https://freepythontips.wordpress.com/

Email == yasoob.khld@gmail.com

Slides == http://slides.com/myasoobkhalid/web-scraping