AN INTRODUCTION TO WEB SCRAPING USING PYTHON

utkarsh2102

AGENDA

  • What is Web Scraping?
  • Useful libraries available.
  • Which library to use for which job?
  • Intro to various frameworks!
  • When and when not to use scrapy?
  • Conclusion!

WEB SCRAPING

WEB SCRAPING

WHAT IS IT?

Web scraping is a technique for gathering data or information on web pages. You could revisit your favourite website every time it updates new information.

 

Or you could write a web scraper to have it do it for you!

WEB SCRAPING

WHAT IS IT?

It is a method to extract data from a website that does not have an API or we want to extract a LOT of data which we can not do through an API due to rate limiting.

 

Through web scraping we can extract any data which we can see while browsing the web.

USAGE

WEB SCRAPING IN REAL LIFE

  • Extract product information.
  • Extract job postings and internships.
  • Extract offers and discounts from deal-of-the-day websites.
  • Crawl forums and social websites.
  • Extract data to make a search engine.
  • Gathering weather data.
  • Etcetera.

ADVANTAGES

WEB SCRAPING V/S USING AN API

  • Web Scraping is not rate limited.
  • Anonymously access the website and gather data.
  • Some websites do not have an API.
  • Some data is not accessible through an API.
  • and many more !

WORKFLOW

ESSENTIAL PARTS OF WEB SCRAPING

Web Scraping follows this workflow:

  • Get the website - using HTTP library.
  • Parse the html document - using any parsing library.
  • Store the results - either a db, csv, text file, etc.

LIBRARIES

USEFUL LIBRARIES AVAILABLE

  • BeautifulSoup (bs4)
  • lxml
  • re
  • scrapy

HTTP LIBRARIES

USEFUL LIBRARIES AVAILABLE

  • Requests
     
  • urllib/urllib2
     
  • httplib/httplib2

 

r = requests.get('https://www.google.com').html
html = urllib2.urlopen('http://python.org/').read()
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://pydelhi.org/", "GET") 

PARSING LIBRARIES

USEFUL LIBRARIES AVAILABLE

  • BeautifulSoup

     
  • lxml

     
  • re

 

tree = BeautifulSoup(html_doc)
tree.title 
tree = lxml.html.fromstring(html_doc)
title = tree.xpath('/title/text()') 
title = re.findall('<title>(.*?)</title>', html_doc)

BEAUTIFUL SOUP

PROS AND CONS!

  • A beautiful API.

     
  • Very easy to use.
  • Can handle broken markup.
  • Purely in Python.
  • Slow :(
soup = BeautifulSoup(html_doc)
last_a_tag = soup.find("a", id="link3")
all_b_tags = soup.find_all("b") 

LXML

PROS AND CONS!

The lxml XML toolkit provides Pythonic bindings for the C libraries libxml2 and libxslt without sacrificing speed.

 

  • Very fast.
  • Not purely in Python.
  • If you have no "pure Python" requirement use lxml.
  • lxml works with all python versions from 2.x to 3.x.

RE

PROS AND CONS!

re is the regex library for Python.

It is used only to extract minute amount of text.

  • Requires you to learn its symbols, e.g:
     
  • Can become complex.
  • Purely baked in Python.
  • A part of standard library.
  • Very fast.
  • Supports every Python version.
'.',*,$,^,\b,\w 

MASSIVE SCRAPING

WHAT TO DO?

  • You want to scrape millions of web pages everyday. 
  • You want to make a broad scale web scraper. 
  • You want to use something that is thoroughly tested.
  • Is there any solution?

SCRAPY

TO THE RESCUE!

  • Scrapy is very fast.

  • Full blown away throughly tested framework.

  • Asynchronous.

  • Easy to use.

  • Has everything you need to start scraping.
     

  • Made in

SCRAPY

WHEN TO USE?

  • When you have to scrape millions of pages.


     
  • When you want asynchronous support out of the box.
  • When you don't want to reinvent the wheel.
  • When you are not afraid to learn something new.

SCRAPY

WHEN NOT TO USE?

  • You are just making a throw away script.
  • You want to crawl a small number of pages. 
  • You want something simple. 
  • You want to reinvent the wheel and
    want to learn the basics.

CONFUSED?

CONFUSED?

WHAT SHOULD YOU USE?

  • So if you want to make a script which does not have to extract a lot of information and if you are not afraid of learning something new then use re.
  • If you want to extract a lot of data and do not have a "pure Python" library requirement then use lxml.
  • If you want to extract information from broken markup then use BeautifulSoup.
  • If you want to scrape a lot of pages and want to use a mature scraping framework then use Scrapy.
[utkarsh2102@karma ~]$ echo "Thank You! :D"
Thank You! :D

THANK YOU! :D

Web Scraping Using Python

By utkarsh2102

Web Scraping Using Python

This slide is made for the purpose of NTCC.

  • 551