Scraping with python 101

Elad Silberring

(USN future CTO)

• companies using scrapy:

No one worth mentioning

• scrapy community:

- 36.3k stars, 8.4k forks and 1.8k watchers on GitHub
- 5.1k followers on Twitter
- 14.7k questions on StackOverflow

• about scrapy:

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

Maintained by Scrapinghub and many other contributors

why use a scraper?

• Deals website

• LinkedIn candidate search

• Data(base) enrichment

• Better SEO

• Know your competition

• Fetch data

Spiders

spider What?

The job of the spider is to fetch specific content from urls and send them to get processed

spiders/some_spider.py

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').get()}

        for next_page in response.css('a.next-posts-link'):
            yield response.follow(next_page, self.parse)

>>> scrapy runspider myspider.py

Text

Stand-alone spider

Scrapy Framework

Spiders - parse data
Items - a model to where data is stored
Loaders - passes data to processors before/after asiignment to item
Pipelines - what to do with the data

Scrapy TOolbox

scrapy shell - a special workplace to run scrapy requests and see their output real time

scrapy shell 'https://scrapy.org'
>>> response.xpath('//title/text()').get()
'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

export - using the -O arg we can output the data into a json file as so:

$ scrapy crawl quotes -O quotes.json

settings - a wide set of options which include:

• AWS integration

• CONCURRENT_REQUESTS

• Logging

• robots.txt options such as user agent, parser and obey rules

$ scrapy startproject tutorial

Scraping with python 101

• companies using scrapy:

• scrapy community:

• about scrapy:

why use a scraper?

Spiders

spider What?

spiders/some_spider.py

Scrapy Framework

Scrapy TOolbox

Scraper tutorial

Scraper tutorial

Elad Silberring

Scraping with python 101

• companies using scrapy:

• scrapy community:

• about scrapy:

why use a scraper?

Spiders

spider What?

spiders/some_spider.py

Scrapy Framework

Scrapy TOolbox

Scraper tutorial

More from Elad Silberring