Guy Freeman, Data Guru Limited, 1st December 2018
When you visit a website, you actually download HTML code from a computer (usually called a server), which your browser converts into a web page.
If you want to collect data from a web site, you can program a computer to automatically visit web pages (aka download HTML) and extract the data directly from the HTML. That's it!
One benefit is that the infinitely multifarious structure of web pages is boiled down to just the pure data that you are interested in, and that you can data science the s&!t out of.
Although Hong Kong is slowly making structured public data available via APIs, e.g. through data.gov.hk, much public data is still only available through web sites, or even through PDFs. Web scraping allows us to collect, analyse, and use the data for whatever purposes we desire.
Alas, due to Hollywood (maybe), the FBI's Ten Most Wanted Fugitives is far more famous than our ICAC's Wanted Person list. Let's fix that by scraping the list on ICAC's web site and making an action movie from it (maybe).
fbi.gov/wanted/topten
icac.org.hk/en/law/wanted/
LAME
AWESOME
Find list of Most Wanted
Scrape data from each Most Wanted
using SelectorGadget
We actually now know enough to scrape! I will now demonstrate LIVE how easy it is to write and execute a scraper using the Python library Scrapy to extract clean, structured information about ICAC's scary Most Wanted!