python

scraping

102


Gaurav Dadhania
@GVRV

use requests

Dev-friendly, better API, 99% of the time works all the time

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

http://docs.python-requests.org/en/latest/

Use Bs4

Parsing HTML (and XML) docs, good API, excellent docs

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>Some blob of HTML</html>")
 
soup.title

soup.title.string

soup.title.parent.name

soup.p['class']

soup.find_all('a', attrs={'class': 'download'})

soup.find(id="bank-details")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Requests Sessions

aka "What do I do if I need to log in first?"

 s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

http://docs.python-requests.org/en/latest/user/advanced/#session-objects

USE CHROME DEVTOOLS

aka "Why isn't my scraper working?!"


USE CHROME DEV TOOLS

aka "Where do I get the CSRF token from?!"


USE CHROME DEVTOOLS

aka "Oh, that click is actually a POST request?!"


use requests streaming responses

aka "how do I grab that SFW video?!"

 r = requests.get(sfw_video, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()

http://docs.python-requests.org/en/latest/user/advanced/#streaming-requests
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

...BUT

The golden rule

of web scraping

is...


Don't be a jerk!


Python Web Scraping 102

By gvrv

Python Web Scraping 102

  • 3,074