manojpandey96
Manoj Pandey
...
Web scraping is a technique for gathering data or information on web pages. You could revisit your favorite web site every time it updates for new information.
Or you could write a web scraper to have it do it for you!
It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
Through web scraping we can extract any data which we can see while browsing the web
Web Scraping follows this workflow:
We will focus more on parsing.
r = requests.get('https://www.google.com').html
html = urllib2.urlopen('http://python.org/').read()
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://pydelhi.org/", "GET")
tree = BeautifulSoup(html_doc)
tree.title
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()')
title = re.findall('<title>(.*?)</title>', html_doc)
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed.
re is the regex library for Python.
It is used only to extract minute amount of text.
'.',*,$,^,\b,\w
import re
import time
import urllib2
from bs4 import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
fn(*args)
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
def bs_test(html):
soup = BeautifulSoup(html)
return soup.html.head.title
def lxml_test(html):
tree = lxmlhtml.fromstring(html)
return tree.xpath('//title')[0].text_content()
def regex_test(html):
return re.findall('', html)[0]
if __name__ == '__main__':
url = 'http://pydelhi.org'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
timeit(fn, html)
manoj@manoj:~/Desktop$ python test.py
bs_test took 1851.457 ms
lxml_test took 232.942 ms
regex_test took 7.186 ms
Scrapy is very fast.
Full blown away throughly tested framework.
Asynchronous.
Easy to use.
Has everything you need to start scraping.
Made in Python.
The workflow in Scrapy:
$ scrapy startproject pycon
pycon
├── scrapy.cfg
└── pycon
├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
Items are containers that will be loaded with the scraped data. They work like simple Python dicts but provide additional protecting against populating undeclared fields, to prevent typos.
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
description = scrapy.Field()
$ scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
title = sel.xpath('//title/text()').extract()
A spider is a class written by the user to scrape data from a website. Writing a spider is easy. Just follow these steps:
scrapy.Spider
import scrapy
from pycon.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
2016-02-28 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: pycon)
2016-02-28 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2016-02-28 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2016-02-28 18:13:07-0400 [dmoz] INFO: Spider opened
2016-02-28 18:13:08-0400 [dmoz] DEBUG: Crawled (200) (referer: None)
2016-02-28 18:13:09-0400 [dmoz] DEBUG: Crawled (200) (referer: None)
2016-02-28 18:13:09-0400 [dmoz] INFO: Closing spider (finished)
$ scrapy crawl dmoz
$ scrapy crawl dmoz -o items.json
You have two choices:
manojpandey96
manojpandey
manojpandey1996