Opening Hong Kong public data to create new machine learning solutions

Guy Freeman, 20th February 2019

Before ML, comes data

  • I get it, I'm a data scientist with a PhD in statistics: I just want to create whizz-bang artificial intelligence robots that trade the world's stock markets and win at every game ever devised by mankind.
  • But we all know what the dirty secret of data science / ML / AI is: we can't create the magic without data.
  • Obvious, right? But the dirtier secret is:

IN GENERAL, GETTING ENOUGH CLEAN DATA IS DAMN HARD

In 2014, the New York Times exposed the truth 😳

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Open Data

“Open data and content can be freely used, modified, and shared by anyone for any purpose

From https://opendefinition.org/:

From http://opendatahandbook.org/guide/en/what-is-open-data/:

  • Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost
  • Re-use and Redistribution: the data must be provided under terms that permit re-use
  • Universal Participation: everyone must be able to use, re-use and redistribute

Hong Kong and Open Data

  • Hong Kong claims to want to be a Smart City; they even have a PDF about this: https://www.smartcity.gov.hk/doc/HongKongSmartCityBlueprint(EN).pdf

Hong Kong and Open Data

  • Hong Kong government does have an Open Data portal nowadays: https://data.gov.hk/

Hong Kong and Open Data

However, data.gov.hk is missing the juiciest datasets:

  • Companies Registry: details on which limited liability companies are registered, still alive, directors, accounts
  • Land Registry: who owns which piece of land or property, how much was paid, lease conditions

Even the data on data.gov.hk isn't as open as can be; much of the data is in Excel spreadsheets at best, or PDF at worst!

Hong Kong and Opening Data

If open data isn't available via an easy method, we can go and create our own.

  • I started dataguru.hk as a public open data platform

Opening Data with dataguru.hk

Using web scraping,  I have collected publicly available data and cleaned it up for clients and the public to easily access via API:

  • Companies Registry
  • SFC licences
  • Disclosures of Interests (i.e. declared ownership stakes in listed companies)
  • Residential property transactions

 

A similar project is webb-site.com, which doesn't have an API, and is focused on HKEX matters.

Understanding Hong Kong with blog.dataguru.hk

Open Data is a democratic tool for understanding social phenomena. On Data Guru's blog I have revealed some truths that were otherwise hidden:

  • Flats with containing the number 4 were cheaper
  • Who admits their stockmarket holdings as late as possible?
  • Why have some companies in Hong Kong changed their names up to 60 times?

Valuing Hong Kong homes with Open Data and some ML

Now that I've started to open some Hong Kong data, after analysing it, I can create ML solutions. I once had horse race tipper, but I'll say no more about that... Today I will show my latest product, truehome.hk

Valuing Hong Kong homes with Open Data and some ML

We gathered the transaction amounts for 1.9 million transactions for 1.7 million "units" (flats or houses) from over 43,000 buildings, and I am building statistical models for predicting the value of any given flat with this data.

  • Currently I am using flat size, floor, age of building and location to estimate value.
  • Future features could include distance from nearest MTR station, property developer, facilities available...
  • Website is already live, please test now and give feedback!

What next in Hong Kong for Open Data?

Open Data, by its very name, is not proprietary. Once it is collected and disseminated, it is a net win for everyone. Get in touch to:

  • suggest public data sources that are not totally Open yet
  • suggest new applications, especially ML applications, that can be built with Open Data... and not just signals for trading!
  • combine your private data with open data and sprinkle on some ML to get better business outcomes, e.g. more money/profit