Levelling Up Your Web Scraping Game
PHP UK Conference 2022
Ian Littman (CTO @ Covie) / @iansltx
follow along at https://ian.im/scrapeuk22
What we'll cover
- Key principles for scraping efficiently
- How to scrape in PHP, with demos!
- Server-side-rendered HTML
- Single page/API driven apps
- What to do when PHP isn't enough
- What about other file types?
Principle #1: It's all about the data
That data gets to a normal user over HTTPS (at least in most cases)
- Step 1: What request(s) do I need to make to get responses including my data?
- Step 2: How do I get the point that I can make those requests?
- Cookies
- Local storage
- Calculated (or random!) request values
- Form submission values (e.g. CSRF, viewstate)
Principle #2: Choose runtime tradeoffs wisely
PHP
Headless Browser
- Pros
- Easy to integrate
- Efficient
- Probably fast
- Cons
- Painful to run JS in...
- ...so you'll need to RE more
- Pros
- Runs JS
- Interact as you would a real browser
- Plenty of tooling
- Cons
- Very resource intensive
- May have a distinct signature
- Have to run it somewhere
Principle #3: Try to hit edge cases during dev
Tool: Firefox
Demo #1: LocalCallingGuide
- Goal: Pull prefix list for a city, accounting for pagination
- Target: Server side rendered PHP app
- Tool: Symfony HttpBrowser
- See the code
Demo #2: Hippo
- Goal: Pull my insurance policy information
- Target: Single page app behind passwordless authentication
- Tool: Guzzle (with cookie jar turned on)
- See the code
...but you probably won't use an interactive CLI...
...so You'll need to Save/restore state between requests
- Cookies
- localStorage
- sessionStorage
- Maybe even other generated stored-in-JS values
If you need to use a real browser...
- Puppeteer - driver for Headless Chrome, rather easy to Dockerize
- Selenium - well-known framework that I've never needed to use
- PhantomJS (incl. CasperJS) - abandoned; don't use this anymore
SOmetimes sites don't want you to scrape
- User agent detection
- IP detection/greylisting/blacklisting
- Browser fingerprinting
- User activity fingerprinting
- Visual CAPTCHAs
- Obfuscated JS
- Use Shape Security's unminify from npm
- --safety wildly-unsafe is usually fine
Principle #4: Try the mobile app
TooL: CharlesProxy for iOS
Caveat: Certificate pinning
Sometimes your data isn't in HTML/JS/XML
- Excel
- XLS/XLSX: PhpSpreadsheet
- XLSB: Libraries exist in other languages
- Encrypted: msoffcrypto Python module
- PDF
- pdftotext
- qpdf --decrypt
- pdfgrep
- imagemagick + tesseract
- Cloud equivalents: AWS Textract, Sensible
- ian.im/scrapeuk22 - these slides
- github.com/iansltx/web-scraping-demos - companion code
- twitter.com/iansltx - me, online
Thanks! Questions?
Levelling Up Your Web Scraping Game - PHP UK 2022
By Ian Littman
Levelling Up Your Web Scraping Game - PHP UK 2022
- 1,020