Levelling Up Your Web Scraping Game

PHP UK Conference 2022

Ian Littman (CTO @ Covie) / @iansltx

follow along at https://ian.im/scrapeuk22

What we'll cover

  • Key principles for scraping efficiently
  • How to scrape in PHP, with demos!
    • Server-side-rendered HTML
    • Single page/API driven apps
  • What to do when PHP isn't enough
  • What about other file types?

Principle #1: It's all about the data

That data gets to a normal user over HTTPS (at least in most cases)

 

  • Step 1: What request(s) do I need to make to get responses including my data?
  • Step 2: How do I get the point that I can make those requests?
    • Cookies
    • Local storage
    • Calculated (or random!) request values
    • Form submission values (e.g. CSRF, viewstate)

Principle #2: Choose runtime tradeoffs wisely

PHP

Headless Browser

  • Pros
    • Easy to integrate
    • Efficient
    • Probably fast
  • Cons
    • Painful to run JS in...
    • ...so you'll need to RE more
  • Pros
    • Runs JS
    • Interact as you would a real browser
    • Plenty of tooling
  • Cons
    • Very resource intensive
    • May have a distinct signature
    • Have to run it somewhere

Principle #3: Try to hit edge cases during dev

Tool: Firefox

Demo #1: LocalCallingGuide

  • Goal: Pull prefix list for a city, accounting for pagination
  • Target: Server side rendered PHP app
  • Tool: Symfony HttpBrowser
  • See the code

Demo #2: Hippo

  • Goal: Pull my insurance policy information
  • Target: Single page app behind passwordless authentication
  • Tool: Guzzle (with cookie jar turned on)
  • See the code

...but you probably won't use an interactive CLI...

...so You'll need to Save/restore state between requests

  • Cookies
  • localStorage
  • sessionStorage
  • Maybe even other generated stored-in-JS values

If you need to use a real browser...

  • Puppeteer - driver for Headless Chrome, rather easy to Dockerize
  • Selenium - well-known framework that I've never needed to use
  • PhantomJS (incl. CasperJS) - abandoned; don't use this anymore

SOmetimes sites don't want you to scrape

  • User agent detection
  • IP detection/greylisting/blacklisting
  • Browser fingerprinting
  • User activity fingerprinting
  • Visual CAPTCHAs
  • Obfuscated JS
    • Use Shape Security's unminify from npm
    • --safety wildly-unsafe is usually fine

Principle #4: Try the mobile app

TooL: CharlesProxy for iOS

Caveat: Certificate pinning

Sometimes your data isn't in HTML/JS/XML

  • Excel
    • XLS/XLSX: PhpSpreadsheet
    • XLSB: Libraries exist in other languages
    • Encrypted: msoffcrypto Python module
  • PDF
    • pdftotext
    • qpdf --decrypt
    • pdfgrep
    • imagemagick + tesseract
    • Cloud equivalents: AWS Textract, Sensible

Thanks! Questions?