We're Hiring!
What we'll cover
- Key principles for scraping efficiently
- How to scrape in PHP, with demos!
- Server-side-rendered HTML
- Single page/API driven apps
- What to do when PHP isn't enough
- What about other file types?
Principle 1: It's all about the data
That data gets to a normal user over HTTPS (at least in most cases)
- Step 1: What request(s) do I need to make to get responses including my data?
- Step 2: How do I get the point that I can make those requests?
- Cookies
- Local storage
- Calculated request values
- Form submission values (e.g. CSRF, viewstate)
Principle 2: Choose runtime tradeoffs wisely
PHP
Headless Browser
- Pros
- Easy to integrate
- Efficient
- Probably fast
- Cons
- Painful to run JS in...
- ...so you'll need to RE more
- Pros
- Runs JS
- Interact as you would a real browser
- Plenty of tooling
- Cons
- Very resource intensive
- May have a distinct signature
- Have to run it somewhere
Principle 3: Try to hit edge cases during dev
Tool #1: Firefox
Demo #1: LocalCallingGuide
- Goal: Pull prefix list for a city, accounting for pagination
- Target: Server side rendered PHP app
- Tool: Symfony HttpBrowser
- See the code
Demo #2: Hippo
- Goal: Pull my insurance policy information
- Target: Single page app behind passwordless authentication
- Tool: Guzzle (with cookie jar turned on)
- See the code
...but you probably won't use an interactive CLI
Save/restore cookies between requests
You may need to look at what gets stored in localStorage/sessionStorage.
If you need to use a real browser...
- Puppeteer - driver for Headless Chrome, rather easy to Dockerize
- Selenium - well-known framework that I've never needed to use
- PhantomJS - abandoned; don't use this anymore
SOmetimes sites don't want you to scrape
- User agent detection
- IP detection/greylisting/blacklisting
- Browser fingerprinting
- User activity fingerprinting
- Visual CAPTCHAs
- Obfuscated JS
- Use unminify from npm
- --safety wildly-unsafe is usually fine
Sometimes your data isn't in HTML/JS/XML
- Excel
- XLS/XLSX: PhpSpreadsheet
- XLSB: Libraries exist in other languages
- Encrypted: msoffcrypto Python module
- PDF
- pdftotext
- qpdf --decrypt
- pdfgrep
- imagemagick + tesseract
CFP closes Wednesday night
- ian.im/scrape0621 - these slides
- github.com/iansltx/web-scraping-demos - companion code
- twitter.com/iansltx - me, online
- ian@covie.io - inquire herein for employment
Thanks! Questions?
Levelling Up Your Web Scraping Game - NomadPHP June 2021
By Ian Littman
Levelling Up Your Web Scraping Game - NomadPHP June 2021
- 1,015