python

scraping

102

Gaurav Dadhania

@GVRV

use requests

Dev-friendly, better API, 99% of the time works all the time

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

http://docs.python-requests.org/en/latest/

Use Bs4

Parsing HTML (and XML) docs, good API, excellent docs

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>Some blob of HTML</html>")
 
soup.title

soup.title.string

soup.title.parent.name

soup.p['class']

soup.find_all('a', attrs={'class': 'download'})

soup.find(id="bank-details")

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Requests Sessions

aka "What do I do if I need to log in first?"

 s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

http://docs.python-requests.org/en/latest/user/advanced/#session-objects

USE CHROME DEVTOOLS

aka "Why isn't my scraper working?!"

USE CHROME DEV TOOLS

aka "Where do I get the CSRF token from?!"

USE CHROME DEVTOOLS

aka "Oh, that click is actually a POST request?!"

use requests streaming responses

aka "how do I grab that SFW video?!"

 r = requests.get(sfw_video, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()

http://docs.python-requests.org/en/latest/user/advanced/#streaming-requests

http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

...BUT

The golden rule

of web scraping

is...

Don't be a jerk!

python

scraping

102

use requests

Use Bs4

Requests Sessions

USE CHROME DEVTOOLS

USE CHROME DEV TOOLS

USE CHROME DEVTOOLS

use requests streaming responses

...BUT

The golden rule

of web scraping

is...

Don't be a jerk!

Python Web Scraping 102

Python Web Scraping 102

gvrv

python

scraping

102

use requests

Use Bs4

Requests Sessions

USE CHROME DEVTOOLS

USE CHROME DEV TOOLS

USE CHROME DEVTOOLS

use requests streaming responses

...BUT

The golden rule

of web scraping

is...

Don't be a jerk!

Python Web Scraping 102

More from gvrv