KIMONO
SCRAPING THE WEB
SCRAPING PRIMER
WEB CRAWLING
WEB SCRAPING
The process of processing a web document and extracting information out of it.
the process of iteratively finding and fetching web links starting from a list of seed URL's
ETHICS OF SCRAPING
There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website."
ETHICAL CHALLENGES
-
Affecting the experience of others by hitting the server too hard
-
Certain uses of data may be copyright violations
- Breaking ToS is not illegal, but it may be considered a breach of contract
CREATE A SCRAPER
STEP ONE
DONT CREATE A SCRAPER
... if copy/paste is faster
DONT CREATE A SCRAPER
... if there is an API
OK, CREATE A SCRAPER
with Kimono, a web-based scraping tool
... but only if
-
The source you are scraping is somewhat clearly structured, and cleanly coded
- The content needs to be static, instead of dynamically generated with JavaScript, no AJAX calls
-
You don't mind your work being public to all
- You don't have complicated auth requirements
- You don't mind the reliance on a third-party service
#builtwithkimono
API SETUP
START PAGE
RECOGNISE SIMILAR DATA
STRUCTURED DATA
MANUALLY CORRECT IF NEEDED
PREVIEW API END POINT
SET A SCHEDULE
SET A CRAWL STRATEGY
USE YOUR API
ACCESS AS JSON / CSV / RSS
SYNC WITH A GOOGLE SHEET
ADD A WIDGET TO YOUR SITE
DISTRIBUTE AS AN APP
TRANSFORM THE DATA
GET SEED URLS FOR 2ND SCRAPER
DEMO
FREE TIER
- Crawl up to 10,000 with a single API
- Access your data in standard formats JSON/CSV/RSS
- Email alerts and webhooks
- Access to the past 30 days of historic data
- Integrations with google sheets and wordpress
BUSINESS TIER
- Probably not for you, but offers
- Private APIs
- Auth support (currently in Beta and also available for free)
- Change Detection
- Outsourced API creation and maintainance
PRICING
m@type.hk
mart van de ven
TALK BY
@tijptjik
Kimono - Scraping the Web
By Mart van de Ven
Kimono - Scraping the Web
Guide to Kimono, a Visual Web Scraper
- 1,545