Manoj Pandey
...
manojpandey96
What/Why Web Scraping
Scraping vs APIs
Useful libraries available
Which library to use for which job
What is Scrapy Framework
When and when not to use !
Legalities ( ͡° ͜ʖ ͡°)
Web scraping is a technique for gathering data or information on web pages.
It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
"If you can see it, you can have it as well"
Web Scraping follows this workflow:
We will focus more on parsing.
r = requests.get('https://www.google.com').html
html = urllib2.urlopen('http://python.org/').read()
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://python.org/", "GET")
tree = BeautifulSoup(html_doc)
tree.title
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()')
title = re.findall('<title>(.*?)</title>', html_doc)
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed.
re is the regex library for Python.
It is used only to extract minute amount of text.
'.',*,$,^,\b,\w
http://.../
<a class="mylink" href="http://.../" ... >
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)
manoj@manoj:~/Desktop$ python test.py
bs_test took 1851.457 ms
lxml_test took 232.942 ms
regex_test took 7.186 ms
Speed: Very fast.
Full blown away throughly tested framework.
Asynchronous.
Easy to use.
Customizable.
No need to reinvent the wheel.
Made in Python.
The workflow in Scrapy:
$ scrapy startproject pycon
pycon
├── scrapy.cfg
└── pycon
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Items are containers that will be loaded with the scraped data. They work like simple Python dicts but provide additional protecting against populating undeclared fields, to prevent typos.
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
description = scrapy.Field()
$ scrapy shell http://dmoztools.net/Computers/Programming/Languages/Python/Books/
title = sel.xpath('//title/text()').extract()
A spider is a class written by the user to scrape data from a website. Writing a spider is easy. Just follow these steps:
scrapy.Spider
import scrapy
from pycon.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://dmoztools.net/Computers/Programming/Languages/Python/Books/",
"http://dmoztools.net/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
2016-02-28 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: pycon)
2016-02-28 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2016-02-28 18:13:07-0400 [dmoz] INFO: Spider opened
2016-02-28 18:13:08-0400 [dmoz] DEBUG: Crawled (200) (referer: None)
2016-02-28 18:13:09-0400 [dmoz] DEBUG: Crawled (200) (referer: None)
2016-02-28 18:13:09-0400 [dmoz] INFO: Closing spider (finished)
$ scrapy crawl dmoz
$ scrapy crawl dmoz -o items.json
You have two choices:
• martin grasser • enrica miron
• rachel knowler • peter inglesham • adrian childers
manojpandey96
manojpandey
manojpandey1996