Web scraping in Python 101
By Yasoob Khalid
Who am I?
I am Muhammad Yasoob Ullah Khalid
-
a programmer
-
a high school student
-
a blogger
-
Pythonista
-
and tea lover
My experience
- Creator of freepythontips
-
Made a couple of open source programs
- A contributor to youtube-dl
-
Teaching programming at my school to my friends
-
It's my first conference!
What this talk is about?
-
This talk is about Web Scraping
-
Which libraries are available for the job
-
Which library is best for which job
-
An intro to Scrapy
-
When and when NOT to use Scrapy
What is Web Scraping ?
Web scraping (web harvesting or web data extraction) is a
computer software technique of extracting information from
websites. Usually, such software programs simulate human
exploration of the World Wide Web by either implementing
low-level Hypertext Transfer Protocol (HTTP), or embedding
a fully-fledged web browser, such as Internet Explorer or
Mozilla Firefox. - Wikipedia
In simple words
It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
Through web scraping we can extract any data which we can see while browsing the web.
Usage of web scraping in real life
-
Extract product information
-
Extract job postings and internships
-
Extract offers and discounts from deal-of-the-day websites
-
Crawl forums and social websites
-
Extract data to make a search engine
-
Gathering weather data
-
etc.
Advantages of Web Scraping over
using an API
-
Web Scraping is not rate limited
-
Anonymously access the website and gather data
-
Some websites do not have an API
-
Some data is not accessible through an API
-
and many more !
Essential parts of Web Scraping
Web Scraping follows this workflow:
- Get the website - using HTTP library
-
Parse the html document - using any parsing library
-
Store the results - either a db, csv, text file, etc
We will focus more on parsing.
Libraries available for the job
Some of the most widely known libraries used for web scraping are:
BeautifulSoup
lxml
re ( not really for web scraping, I will explain later )
Scrapy ( a complete framework )
Some HTTP libraries for web scraping
r = requests.get('https://www.google.com').html
html = urllib2.urlopen('http://python.org/').read()
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://example.org/", "GET")
Parsing libraries
tree = BeautifulSoup(html_doc)
tree.title
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()')
title = re.findall('<title>(.*?)</title>', html_doc)
BeautifulSoup
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
-
very easy to use
-
can handle broken markup
-
purely in Python
-
slow :(
lxml
The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed.
-
very fast
-
not purely in Python
-
If you have no "pure Python" requirement use lxml
-
lxml works with all python versions from 2.4 to 3.3
re
re is the regex library for Python. It is used only to extract minute amount of text. Entire HTML parsing is not possible with regular expressions. Its unpopularity is due to:
-
requires you to learn its symbols e.g
However it is
- purely baked in Python
- a part of standard library
- very fast - I will show later
-
supports every Python version
Comparison of lxml, re and BeautifulSoup
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://freepythontips.wordpress.com/'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)
The result
yasoob@yasoob:~/Desktop$ python test.py
bs_test took 1851.457 ms
lxml_test took 232.942 ms
regex_test took 7.186 ms
-
lxml took 32x more time than re
-
BeautifulSoup took 245x! more time than re
What to do when your scraping needs are very high?
-
You want to scrape millions of web pages everyday.
-
You want to make a broad scale web scraper.
-
You want to use something that is thoroughly tested
-
Is there any solution ?
Yes there is a solution!
One word, use Scrapy!
How does Scrapy compare to BeautifulSoup or lxml?
BeautifulSoup and lxml are libraries for parsing HTML and XML. Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. In other words, comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django - scrapy docs.
Scrapy only supports Python 2.7 and NOT 3.x
When to use Scrapy
-
When you have to scrape millions of pages
-
When you want asynchronous support out of the box
-
When you don't want to reinvent the wheel
- When you are not afraid to learn something new
If you are not willing to risk the unusual, you will have to settle for the ordinary – Jim Rohn
Starting out with Scrapy
The workflow in Scrapy:
-
Define a scraper
-
Define the items you are going to extract
- Define the items pipeline (Optional)
-
Run the scraper
Note: I will just demonstrate the basic building blocks of Scrapy. In Scrapy a scraper is called a spider
Using the Scrapy command-line tool
scrapy startproject tutorial
- This will create the following directory and files:
tutorial
├── scrapy.cfg
└── tutorial
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
-
A project can have multiple spiders
What is an Item?
Items are containers that will be loaded with the scraped data. They work like simple Python dicts but provide additional protecting against populating undeclared fields, to prevent typos.
#tutorial/tutorial/items.py
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
description = scrapy.Field()
Extracting data
-
Use the scrapy shell to test scraping:
scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
-
Scrapy provides xpaths, css selectors and regex to extract data
- Extracting the title using xpath:
title = sel.xpath('//title/text()').extract()
Writing the first scraper
A spider is a class written by the user to scrape data from a website. Writing a spider is easy. Just follow these steps:
- Define start_urls list
- Define the parse method in your spider
Full Spider
#tutorial/tutorial/spiders/spider.py
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
Unleash the Scrapy power!
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) (referer: None)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) (referer: None)
2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)
Storing the scraped data
You have two choices:
- Use feed export
-
Define Item pipelines
scrapy crawl dmoz -o items.json
-
Item pipelines are a separate topic and will be covered in future
When not to use Scrapy
-
You are just making a throw away script
-
You want to crawl a small number of pages.
-
You want something simple.
-
You want to reinvent the wheel and want to learn the basics
So what should you use?
- So if you want to make a script which does not have to extract a lot of information and if you are not afraid of learning something new then use re.
- If you want to extract a lot of data and do not have a "pure Python" library requirement then use lxml
- If you want to extract information from broken markup then use BeautifulSoup.
- If you want to scrape a lot of pages and want to use a mature scraping framework then use Scrapy.
What do I prefer ?
Seriously speaking! I prefer re and Scrapy. I started web scraping with BeautifulSoup as it was the easiest. Then I used lxml and soon found BeautifulSoup slow. Then I used re for some time and fell in love with it. I use scrapy only to make large scrapers or when I need to get a lot of data. Once I used scrapy to scrape 69,000 torrent links from a website.
What is youtube-dl ?
It is a python script that allows you to download videos and music from various websites like :
-
Facebook
- YouTube
-
Vimeo
- Dailymotion
-
Metacafe
-
and almost 300 more !
Well that was it !
I hope you learned something about web scraping. It was my first conference so forgive me for any mistakes. If you want to talk to me later then meet me outside. If you want to ask something then don't hesitate and I will try to answer.
Questions?
Facebook== fb.me/m.yasoob
Twitter== @yasoobkhalid
Blog == https://freepythontips.wordpress.com/
Email == yasoob.khld@gmail.com
Slides == http://slides.com/myasoobkhalid/web-scraping