An Introduction to Scrapy
Context: Cinebot
?
Problem: Welcome to the UK
?
- No single source for showtimes data
- Many different cinema companies
- No API to get their data
Solution: Scraping
LET'S GET THE DATA
ON THE GEMBA
Starting small with Curzon cinemas
Limited number of theaters
Decent movie listing
My flatmate can get me free tickets
Curzon "Now Showing" page
Data needed
{
"place": {
"name": "Le Cinéma des Cinéastes",
"city": "Paris 17e arrondissement",
"postalCode": "75017",
},
"movie": {
"title": "Moonlight",
"language": "Anglais",
"vo": true,
"posterUrl": "http://images.allocine.fr/pictures/17/01/26/09/46/162340.jpg",
"pressRating": 4.18421,
"userRating": 4.07561159,
"url": "http://www.allocine.fr/film/fichefilm_gen_cfilm=242054.html",
"is3D": false,
"releaseDate": "2017-02-01",
"trailerUrl": "http://www.allocine.fr/video/player_gen_cmedia=19565733&cfilm=242054.html"
},
"date": "2017-02-24",
"times": {
"13:15": "https://tickets.allocine.fr/paris-le-brady/reserver/F191274/D1488024900/VO",
"17:30": "https://tickets.allocine.fr/paris-le-brady/reserver/F191274/D1488040200/VO",
"19:45": "https://tickets.allocine.fr/paris-le-brady/reserver/F191274/D1488048300/VO",
"22:00": "https://tickets.allocine.fr/paris-le-brady/reserver/F191274/D1488056400/VO"
},
"_geoloc": {
"lat": 48.883658,
"lng": 2.327202
}
}
Scrapy: a Python framework for web-scraping
- Python
- Elegant data flow to write reusable code
- Asynchronous
Declaring Items
Creating Pipeline to Algolia
# pipelines.py
from algoliasearch import algoliasearch
import time
TODAY = time.strftime('%Y-%m-%d')
class CurzonScraperPipeline(object):
def open_spider(self, spider):
self.client = algoliasearch.Client('Th30d0', 'L34n1s4J0uRn3Y')
self.index = self.client.init_index('pariscine_seances')
def process_item(self, item, spider):
if(item['date'] == TODAY):
self.index.add_object(item)
return item
# settings.py
ITEM_PIPELINES = {
'curzon_scraper.pipelines.CurzonScraperPipeline': 300,
}
Multiple pipelines
# pipelines.py
class CurzonScraperDateFilterPipeline(object):
def process_item(self, item, spider):
if(item.get('date') == TODAY):
return item
else:
raise DropItem('Dropping showtime %s not for today' % item)
class CurzonScraperAlgoliaPipeline(object):
def open_spider(self, spider):
self.client = algoliasearch.Client('Th30d0', 'L34n1s4J0uRn3Y')
self.index = self.client.init_index('pariscine_seances')
def process_item(self, item, spider):
self.index.add_object(item)
return item
# settings.py
ITEM_PIPELINES = {
'curzon_scraper.pipelines.CurzonScraperFilterPipeline': 300,
'curzon_scraper.pipelines.CurzonScraperAlgoliaPipeline': 350,
}
Creating a Spider
import scrapy, json, time, re
from curzon_scraper.items import *
class CurzonSpider(scrapy.Spider):
name = 'curzon'
start_urls = ['https://www.curzoncinemas.com/bloomsbury/now-showing']
def parse(self, response):
yield {
'date': '',
'movie': {},
'_geoloc': {},
'place': {},
'times': {},
}
Selectors: CSS vs Xpath
Selectors: CSS vs Xpath
Goal | CSS 3 | XPATH |
---|---|---|
All Elements | * | //* |
All div Elements | div | //div |
All child elements | div > * | //div/* |
Element By ID | #foo | //*[@id=’foo’] |
Element By Class | .foo | //*[contains(@class,’foo’)] |
Element With Attribute | *[title] | //*[@title] |
All <div> with an <a> child | Not possible | //div[a] |
First Child of All <div> | div > *:first-child | //div/*[0] |
Next Element | div + * | //div/following-sibling::*[0] |
In a Nutshell
-
CSS is faster
-
Easier to guess from page-inspection
- More readable
CSS
LESS COOLEST
Live Scraping
CSS
LES COOLEST
Questions?
CSS
LES COOLEST
An Introduction to Scrapy
By Jeremy Gotteland
An Introduction to Scrapy
- 439