Levelling Up Your Web Scraping Game

July 29, 2021

Ian Littman / @iansltx

follow along at https://ian.im/scrape0721

What we'll cover

  • Key principles for scraping efficiently
  • How to scrape in PHP (with a demo!)
  • What to do when PHP isn't enough

Principle 1: It's all about the data

That data gets to a normal user over HTTPS (at least in most cases)

 

  • Step 1: What request(s) do I need to make to get responses including my data?
  • Step 2: How do I get the point that I can make those requests?
    • Cookies
    • Local storage
    • Calculated request values
    • Form submission values (e.g. CSRF, viewstate)

Principle 2: Choose runtime tradeoffs wisely

HTTP client/scraping library

Headless Browser

  • Pros
    • Easy to integrate
    • Efficient
    • Probably fast
  • Cons
    • Painful to run JS in...
    • ...unless you're using JS libs...
    • ...so you'll need to RE more
  • Pros
    • Runs JS
    • Interact as you would a real browser
    • Plenty of tooling
  • Cons
    • Very resource intensive
    • May have a distinct signature
    • Have to run it somewhere

Principle 3: Try to hit edge cases during dev

Tool #1: Firefox

Demo: Hippo

  • Goal: Pull my insurance policy information
  • Target: Single page app behind passwordless authentication
  • Tool: Guzzle (with cookie jar turned on)
  • See the code

...but you probably won't use an interactive CLI

Save/restore cookies between requests

 

You may need to look at what gets stored in localStorage/sessionStorage.

If you need to use a real browser...

  • Puppeteer - driver for Headless Chrome, rather easy to Dockerize
  • Selenium - well-known framework that I've never needed to use
  • PhantomJS - abandoned; don't use this anymore

SOmetimes sites don't want you to scrape

  • User agent detection
  • IP detection/greylisting/blacklisting
  • Browser fingerprinting
  • User activity fingerprinting
  • Visual CAPTCHAs
  • Obfuscated JS
    • Use unminify from npm
    • --safety wildly-unsafe is usually fine

Thanks! Questions?