Levelling Up Your Web Scraping Game

NomadPHP US June 2021

Ian Littman / @iansltx

follow along at https://ian.im/scrape0621

We're Hiring!

What we'll cover

  • Key principles for scraping efficiently
  • How to scrape in PHP, with demos!
    • Server-side-rendered HTML
    • Single page/API driven apps
  • What to do when PHP isn't enough
  • What about other file types?

Principle 1: It's all about the data

That data gets to a normal user over HTTPS (at least in most cases)

 

  • Step 1: What request(s) do I need to make to get responses including my data?
  • Step 2: How do I get the point that I can make those requests?
    • Cookies
    • Local storage
    • Calculated request values
    • Form submission values (e.g. CSRF, viewstate)

Principle 2: Choose runtime tradeoffs wisely

PHP

Headless Browser

  • Pros
    • Easy to integrate
    • Efficient
    • Probably fast
  • Cons
    • Painful to run JS in...
    • ...so you'll need to RE more
  • Pros
    • Runs JS
    • Interact as you would a real browser
    • Plenty of tooling
  • Cons
    • Very resource intensive
    • May have a distinct signature
    • Have to run it somewhere

Principle 3: Try to hit edge cases during dev

Tool #1: Firefox

Demo #1: LocalCallingGuide

  • Goal: Pull prefix list for a city, accounting for pagination
  • Target: Server side rendered PHP app
  • Tool: Symfony HttpBrowser
  • See the code

Demo #2: Hippo

  • Goal: Pull my insurance policy information
  • Target: Single page app behind passwordless authentication
  • Tool: Guzzle (with cookie jar turned on)
  • See the code

...but you probably won't use an interactive CLI

Save/restore cookies between requests

 

You may need to look at what gets stored in localStorage/sessionStorage.

If you need to use a real browser...

  • Puppeteer - driver for Headless Chrome, rather easy to Dockerize
  • Selenium - well-known framework that I've never needed to use
  • PhantomJS - abandoned; don't use this anymore

SOmetimes sites don't want you to scrape

  • User agent detection
  • IP detection/greylisting/blacklisting
  • Browser fingerprinting
  • User activity fingerprinting
  • Visual CAPTCHAs
  • Obfuscated JS
    • Use unminify from npm
    • --safety wildly-unsafe is usually fine

Sometimes your data isn't in HTML/JS/XML

  • Excel
    • XLS/XLSX: PhpSpreadsheet
    • XLSB: Libraries exist in other languages
    • Encrypted: msoffcrypto Python module
  • PDF
    • pdftotext
    • qpdf --decrypt
    • pdfgrep
    • imagemagick + tesseract

CFP closes Wednesday night

Thanks! Questions?