python
scraping
102
Gaurav Dadhania
@GVRV
use requests
Dev-friendly, better API, 99% of the time works all the time
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}
http://docs.python-requests.org/en/latest/
Use Bs4
Parsing HTML (and XML) docs, good API, excellent docs
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>Some blob of HTML</html>")
soup.title
soup.title.string
soup.title.parent.name
soup.p['class']
soup.find_all('a', attrs={'class': 'download'})
soup.find(id="bank-details")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Requests Sessions
aka "What do I do if I need to log in first?"
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/latest/user/advanced/#session-objects
USE CHROME DEVTOOLS
aka "Why isn't my scraper working?!"
USE CHROME DEV TOOLS
aka "Where do I get the CSRF token from?!"
USE CHROME DEVTOOLS
aka "Oh, that click is actually a POST request?!"
use requests streaming responses
aka "how do I grab that SFW video?!"
r = requests.get(sfw_video, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()
http://docs.python-requests.org/en/latest/user/advanced/#streaming-requests
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow
...BUT
The golden rule
of web scraping
is...