Scraping with python 101

Elad Silberring

(USN future CTO)

• companies using scrapy:

No one worth mentioning

• scrapy community:

• about scrapy:

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

Maintained by Scrapinghub and many other contributors

why use a scraper?

• Deals website

• LinkedIn candidate search

• Data(base) enrichment

• Better SEO

• Know your competition

• Fetch data

• Fetch data

• Fetch data

• Fetch data

• Fetch data

Spiders

spider What?

The job of the spider is to fetch specific content from urls and send them to get processed

spiders/some_spider.py

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)

 

>>> scrapy runspider myspider.py

 

Text

Stand-alone spider

Scrapy Framework

  • Spiders - parse data
  • Items - a model to where data is stored
  • Loaders -  passes data to processors before/after asiignment to item
  • Pipelines - what to do with the data

Scrapy TOolbox

 scrapy shell - a special workplace to run scrapy requests and see their output real time

scrapy shell 'https://scrapy.org'
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

 export - using the -O arg we can output the data into a json file as so:

$ scrapy crawl quotes -O quotes.json

 settings - a wide set of options which include:

 

• AWS integration

CONCURRENT_REQUESTS

• Logging

• robots.txt options such as user agent, parser and obey rules

$ scrapy startproject tutorial

Scraper tutorial

By Elad Silberring

Scraper tutorial

  • 644