How to value millions of Hong Kong flats using AI

Guy Freeman, 10th July 2019

Before AI: get the data

While machine learning is all the rage, unfortunately real life isn't Kaggle

  • Most of the training of data scientists involves learning multitudes of models and how to run them, from linear models in R or Python notebooks, to deep learning on GPUs in the cloud
  • That's all very well, but after many years of using data science in academia and industry to generate value, I want to reiterate the dirty secret: 90% of our time is spent on data management

In 2014, the New York Times already exposed the truth 😳

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

So now what?

Where do we get data around home prices and other building characteristics?

  • Ideally, at least some of this data would be available as Open Data
  • The goverment's Land Registry does make available daily property transaction information that is registered with them (as all transactions must be)... for HK$19,100/month.
  • This is the same information available on large HK estate agents' websites

Open Data

“Open data and content can be freely used, modified, and shared by anyone for any purpose

From https://opendefinition.org/:

From http://opendatahandbook.org/guide/en/what-is-open-data/:

  • Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost
  • Re-use and Redistribution: the data must be provided under terms that permit re-use
  • Universal Participation: everyone must be able to use, re-use and redistribute

Hong Kong and Open Data

  • Hong Kong claims to want to be a Smart City; they even have a PDF about this: https://www.smartcity.gov.hk/doc/HongKongSmartCityBlueprint(EN).pdf

Hong Kong and Open Data

  • Hong Kong government does have an Open Data portal nowadays: https://data.gov.hk/

What data is available now, and where?

  • Property prices paid in actual transactions: Land Registry, via large HK estate agents
  • Hong Kong residential building details: scrape HK estate agent websites
  • How to search HK estate agent websites? (There is no list of buildings). Search for every street!
  • HK street names? Open data!
  • List of residential buildings in Hong Kong? Not open data... and not very good anyway!

What is web scraping?

When you visit a website, you actually download HTML code from a computer (usually called a server), which your browser converts into a web page.

Web scraping resources

Data pipelines

  • Data can be usefully compared to water: it has a source, and from then on it flows
  • Just like with lakes and reservoirs and canals and aqueducts, we can and should control the flow to extract maximum value from the water data
  • Where the source of data is websites, the "well" might become polluted, via changes in structure, or server errors, or input errors: we must monitor
  • Then the pipeline might get leaky
  • This is a surprisingly unsolved problem using open source technology

Data pipelines

  • Traditionally, this pipeline was called "ETL", for Extract, Transform, Load
  • The optimal ETL architecture will depend on the data use case: do we need real-time access? Is the source streaming?
  • In this case, I use Airflow to orchestrate mostly Python scripts to read from the JSON files I dumped into S3 from the scraping and, after validation, update a "source of truth" PostgreSQL database. I keep track of which JSON files I've already imported. This way I can recreate the database easily.
  • I'm not suggesting this is the optimal way!

After collecting, validating, transforming, storing the data... We can now do some machine learning

We gathered the transaction amounts for almost 2 million transactions for over 1.7 million "units" (flats or houses) from over 43,000 buildings, and built a statistical models for predicting the value of any given flat with this data.

  • Currently I am using flat size, floor, age of building and location to estimate value.
  • Future features could include distance from nearest MTR station, property developer, facilities available...

Modelling results

The aim of this talk was to get data scientists to think of the end-to-end journey of data, from source to value creation and real-world decision-making, and not just the model... But I know you can't resist caring about the model! So here are the modelling details I can share:

  • Not a deep learning model: p is small, and the n isn't that big either
  • Not a linear model either
  • I found I only needed 10-15% sample of transactions to get a good result
  • I gave up on houses
  • I also gave up if I couldn't get lat/long of the building
  • No village houses in the dataset

Modelling results

The aim of this talk was to get data scientists to think of the end-to-end journey of data, from source to value creation and real-world decision-making, and not just the model... But I know you can't resist caring about the model! So here are the modelling details I can share:

  • Median error rates:
    • Where construction date, saleable area and lat/long are available: 9%
    • Where saleable area and lat/long are available: 9%
    • Where only lat/long is available: 17%
  • 16,737 houses out of 43,359 buildings!?

Unfortunately after months of scraping data and cleaning it and dealing with data quality and building and deploying the model... My valuation app looked like this:

I clearly needed to admit this wasn't my specialty, and instead bring in the nearest Silicon Valley product manager I could find

How to value Hong Kong flats using AI

By slygent

How to value Hong Kong flats using AI

  • 1,936