How to value millions of Hong Kong flats using AI

Guy Freeman, 10th July 2019

Data Science Hong Kong

Before AI: get the data

While machine learning is all the rage, unfortunately real life isn't Kaggle

Most of the training of data scientists involves learning multitudes of models and how to run them, from linear models in R or Python notebooks, to deep learning on GPUs in the cloud
That's all very well, but after many years of using data science in academia and industry to generate value, I want to reiterate the dirty secret: 90% of our time is spent on data management

In 2014, the New York Times already exposed the truth 😳

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

So now what?

Where do we get data around home prices and other building characteristics?

Ideally, at least some of this data would be available as Open Data
The goverment's Land Registry does make available daily property transaction information that is registered with them (as all transactions must be)... for HK$19,100/month.
This is the same information available on large HK estate agents' websites

Open Data

“Open data and content can be freely used, modified, and shared by anyone for any purpose”

From https://opendefinition.org/:

From http://opendatahandbook.org/guide/en/what-is-open-data/:

Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost
Re-use and Redistribution: the data must be provided under terms that permit re-use
Universal Participation: everyone must be able to use, re-use and redistribute

Hong Kong and Open Data

Hong Kong claims to want to be a Smart City; they even have a PDF about this: https://www.smartcity.gov.hk/doc/HongKongSmartCityBlueprint(EN).pdf

Hong Kong and Open Data

Hong Kong government does have an Open Data portal nowadays: https://data.gov.hk/

What data is available now, and where?

Property prices paid in actual transactions: Land Registry, via large HK estate agents
Hong Kong residential building details: scrape HK estate agent websites
How to search HK estate agent websites? (There is no list of buildings). Search for every street!
HK street names? Open data!
List of residential buildings in Hong Kong? Not open data... and not very good anyway!

What is web scraping?

When you visit a website, you actually download HTML code from a computer (usually called a server), which your browser converts into a web page.

Web scraping resources

Practical Web Scraping for Data Science: Best Practices and Examples with Python
https://selectorgadget.com/
https://scrapy.org/
Scrapinghub (hosted scrapy service)
I'm thinking of running some workshops... Interested?

Data pipelines

Data can be usefully compared to water: it has a source, and from then on it flows
Just like with lakes and reservoirs and canals and aqueducts, we can and should control the flow to extract maximum value from the ~~water~~ data
Where the source of data is websites, the "well" might become polluted, via changes in structure, or server errors, or input errors: we must monitor
Then the pipeline might get leaky
This is a surprisingly unsolved problem using open source technology

Data pipelines

Traditionally, this pipeline was called "ETL", for Extract, Transform, Load
The optimal ETL architecture will depend on the data use case: do we need real-time access? Is the source streaming?
In this case, I use Airflow to orchestrate mostly Python scripts to read from the JSON files I dumped into S3 from the scraping and, after validation, update a "source of truth" PostgreSQL database. I keep track of which JSON files I've already imported. This way I can recreate the database easily.
I'm not suggesting this is the optimal way!

After collecting, validating, transforming, storing the data... We can now do some machine learning

We gathered the transaction amounts for almost 2 million transactions for over 1.7 million "units" (flats or houses) from over 43,000 buildings, and built a statistical models for predicting the value of any given flat with this data.

Currently I am using flat size, floor, age of building and location to estimate value.
Future features could include distance from nearest MTR station, property developer, facilities available...

Modelling results

The aim of this talk was to get data scientists to think of the end-to-end journey of data, from source to value creation and real-world decision-making, and not just the model... But I know you can't resist caring about the model! So here are the modelling details I can share:

Not a deep learning model: p is small, and the n isn't that big either
Not a linear model either
I found I only needed 10-15% sample of transactions to get a good result
I gave up on houses
I also gave up if I couldn't get lat/long of the building
No village houses in the dataset

Modelling results

Median error rates:
- Where construction date, saleable area and lat/long are available: 9%
- Where saleable area and lat/long are available: 9%
- Where only lat/long is available: 17%
16,737 houses out of 43,359 buildings!?

Unfortunately after months of scraping data and cleaning it and dealing with data quality and building and deploying the model... My valuation app looked like this:

I clearly needed to admit this wasn't my specialty, and instead bring in the nearest Silicon Valley product manager I could find

How to value millions of Hong Kong flats using AI

Before AI: get the data

While machine learning is all the rage, unfortunately real life isn't Kaggle

So now what?

Where do we get data around home prices and other building characteristics?

Open Data

Hong Kong and Open Data

Hong Kong and Open Data

What data is available now, and where?

What is web scraping?

Web scraping resources

Data pipelines

Data pipelines

After collecting, validating, transforming, storing the data... We can now do some machine learning

Modelling results

Modelling results

How to value Hong Kong flats using AI

How to value Hong Kong flats using AI

slygent

How to value millions of Hong Kong flats using AI

Before AI: get the data

While machine learning is all the rage, unfortunately real life isn't Kaggle

So now what?

Where do we get data around home prices and other building characteristics?

Open Data

Hong Kong and Open Data

Hong Kong and Open Data

What data is available now, and where?

What is web scraping?

Web scraping resources

Data pipelines

Data pipelines

After collecting, validating, transforming, storing the data... We can now do some machine learning

Modelling results

Modelling results

How to value Hong Kong flats using AI

More from slygent