Levelling Up Your Web Scraping Game

July 29, 2021

Ian Littman / @iansltx

follow along at https://ian.im/scrape0721

What we'll cover

Key principles for scraping efficiently
How to scrape in PHP (with a demo!)
What to do when PHP isn't enough

Principle 1: It's all about the data

That data gets to a normal user over HTTPS (at least in most cases)

Step 1: What request(s) do I need to make to get responses including my data?
Step 2: How do I get the point that I can make those requests?
- Cookies
- Local storage
- Calculated request values
- Form submission values (e.g. CSRF, viewstate)

Principle 2: Choose runtime tradeoffs wisely

HTTP client/scraping library

Headless Browser

Pros
- Easy to integrate
- Efficient
- Probably fast
Cons
- Painful to run JS in...
- ...unless you're using JS libs...
- ...so you'll need to RE more

Pros
- Runs JS
- Interact as you would a real browser
- Plenty of tooling
Cons
- Very resource intensive
- May have a distinct signature
- Have to run it somewhere

Principle 3: Try to hit edge cases during dev

Tool #1: Firefox

Demo: Hippo

Goal: Pull my insurance policy information
Target: Single page app behind passwordless authentication
Tool: Guzzle (with cookie jar turned on)
See the code

...but you probably won't use an interactive CLI

Save/restore cookies between requests

You may need to look at what gets stored in localStorage/sessionStorage.

If you need to use a real browser...

Puppeteer - driver for Headless Chrome, rather easy to Dockerize
Selenium - well-known framework that I've never needed to use
PhantomJS - abandoned; don't use this anymore

SOmetimes sites don't want you to scrape

User agent detection
IP detection/greylisting/blacklisting
Browser fingerprinting
User activity fingerprinting
Visual CAPTCHAs
Obfuscated JS
- Use unminify from npm
- --safety wildly-unsafe is usually fine

ian.im/scrape0721 - these slides
ian.im/scrape0621 - a longer version of this talk
github.com/iansltx/web-scraping-demos - companion code
twitter.com/iansltx - me, online
covie.com - my company, online

Thanks! Questions?

Leveling Up Your Web Scraping Game - July 2021

By Ian Littman

Leveling Up Your Web Scraping Game - July 2021

980

Ian Littman