What we'll cover
- Key principles for scraping efficiently
- How to scrape in PHP (with a demo!)
- What to do when PHP isn't enough
Principle 1: It's all about the data
That data gets to a normal user over HTTPS (at least in most cases)
- Step 1: What request(s) do I need to make to get responses including my data?
- Step 2: How do I get the point that I can make those requests?
- Cookies
- Local storage
- Calculated request values
- Form submission values (e.g. CSRF, viewstate)
Principle 2: Choose runtime tradeoffs wisely
HTTP client/scraping library
Headless Browser
- Pros
- Easy to integrate
- Efficient
- Probably fast
- Cons
- Painful to run JS in...
- ...unless you're using JS libs...
- ...so you'll need to RE more
- Pros
- Runs JS
- Interact as you would a real browser
- Plenty of tooling
- Cons
- Very resource intensive
- May have a distinct signature
- Have to run it somewhere
Principle 3: Try to hit edge cases during dev
Tool #1: Firefox
Demo: Hippo
- Goal: Pull my insurance policy information
- Target: Single page app behind passwordless authentication
- Tool: Guzzle (with cookie jar turned on)
- See the code
...but you probably won't use an interactive CLI
Save/restore cookies between requests
You may need to look at what gets stored in localStorage/sessionStorage.
If you need to use a real browser...
- Puppeteer - driver for Headless Chrome, rather easy to Dockerize
- Selenium - well-known framework that I've never needed to use
- PhantomJS - abandoned; don't use this anymore
SOmetimes sites don't want you to scrape
- User agent detection
- IP detection/greylisting/blacklisting
- Browser fingerprinting
- User activity fingerprinting
- Visual CAPTCHAs
- Obfuscated JS
- Use unminify from npm
- --safety wildly-unsafe is usually fine
- ian.im/scrape0721 - these slides
- ian.im/scrape0621 - a longer version of this talk
- github.com/iansltx/web-scraping-demos - companion code
- twitter.com/iansltx - me, online
- covie.com - my company, online
Thanks! Questions?
Leveling Up Your Web Scraping Game - July 2021
By Ian Littman
Leveling Up Your Web Scraping Game - July 2021
- 952