an undelicate and hopefully thoughtful story about hard procrastination while looking for a flat for rent
!
?!
?!
I want to find *nice* flats with data pipeline magic, so...
There are some potential issues:
There are different sources to extract the same information, but each has its ups and downs.
"flatsite"
"flatpoint"
a website containing information about a flat
a website containing a list of flats to buy or rent
Price
geolocation
Rooms
Surface
parseable
data
Price
geolocation
Rooms
Surface
Separating what to crawl allows:
Web sites protect themselves against intensive crawling
Tor seems an alternative to anonymize oneselve, but it's usually actively banned by website providers.
"headless chrome" might be a feasible alternative to programatic APIs to simulate a browser.
It all begins with someone sending data over the wire...
Redis stores visited flatpoints and flatsites
then this raw data is sent to a hardened, back-pressured, *distributed* temporal storage...
and wait for data being consumed by a Storm topology that will perform the feature extraction from raw data...
to be heavily munched by a Spark job that will aggregate all data by different dimensions and criterias...
to finally visualize it in a nice Angular-based SPA,
backed by a MongoDB
All of this backed by a single physical machine, some Vagrant-powered VMs with Structor
After some catastrophic failure due to disk space outage, I have to rethink the whole architecture
Change HBase to Elasticsearch
Change Storm to Flink
Change Structor to Construct