utkarsh2102
Web scraping is a technique for gathering data or information on web pages. You could revisit your favourite website every time it updates new information.
Or you could write a web scraper to have it do it for you!
It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.
Through web scraping we can extract any data which we can see while browsing the web.
Web Scraping follows this workflow:
r = requests.get('https://www.google.com').html
html = urllib2.urlopen('http://python.org/').read()
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://pydelhi.org/", "GET")
tree = BeautifulSoup(html_doc)
tree.title
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()')
title = re.findall('<title>(.*?)</title>', html_doc)
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b")
PROS AND CONS!
The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed.
PROS AND CONS!
re is the regex library for Python.
It is used only to extract minute amount of text.
'.',*,$,^,\b,\w
WHAT TO DO?
TO THE RESCUE!
Scrapy is very fast.
Full blown away throughly tested framework.
Asynchronous.
Easy to use.
Has everything you need to start scraping.
Made in
WHEN TO USE?
WHEN NOT TO USE?
WHAT SHOULD YOU USE?
[utkarsh2102@karma ~]$ echo "Thank You! :D"
Thank You! :D