What we'll cover
- Key principles for scraping efficiently
- How to scrape in PHP (with a demo!)
- What to do when PHP isn't enough
Principle 1: It's all about the data
That data gets to a normal user over HTTPS (at least in most cases)
- Step 1: What request(s) do I need to make to get responses including my data?
- Step 2: How do I get the point that I can make those requests?
- Local storage
- Calculated request values
- Form submission values (e.g. CSRF, viewstate)
Principle 2: Choose runtime tradeoffs wisely
HTTP client/scraping library
- Easy to integrate
- Probably fast
- Painful to run JS in...
- ...unless you're using JS libs...
- ...so you'll need to RE more
- Runs JS
- Interact as you would a real browser
- Plenty of tooling
- Very resource intensive
- May have a distinct signature
- Have to run it somewhere
Principle 3: Try to hit edge cases during dev
Tool #1: Firefox
- Goal: Pull my insurance policy information
- Target: Single page app behind passwordless authentication
- Tool: Guzzle (with cookie jar turned on)
- See the code
...but you probably won't use an interactive CLI
Save/restore cookies between requests
You may need to look at what gets stored in localStorage/sessionStorage.
If you need to use a real browser...
- Puppeteer - driver for Headless Chrome, rather easy to Dockerize
- Selenium - well-known framework that I've never needed to use
- PhantomJS - abandoned; don't use this anymore
SOmetimes sites don't want you to scrape
- User agent detection
- IP detection/greylisting/blacklisting
- Browser fingerprinting
- User activity fingerprinting
- Visual CAPTCHAs
- Obfuscated JS
- Use unminify from npm
- --safety wildly-unsafe is usually fine
Copy of Leveling Up Your Web Scraping Game - July 2021
By Ian Littman