A data scientist does some journalism (kinda)

Guy Freeman, 9th March 2019

A little bit about me

I'm not really a journalist in any sense

  • PhD in Statistics
  • Then epidemiologist postdoc at Hong Kong University
  • Then data scientist in startups and corporations
  • Including HK01 (on which more later)

But I love open data and other technical ways to help journalists do their jobs

Dataguru.hk:

Before AI / journalism, comes data

  • I get it, I'm a data scientist who just wants to create robots that trade the world's stock markets and win at every game ever devised by mankind, and journalists want to win the Pulitzer
  • But we all know what the dirty secret of both data science and investigative journalism is: we can't create the magic without data.
  • Obvious, right? But the dirtier secret is:

IN GENERAL, GETTING CLEAN DATA IS HARD

In 2014, the New York Times exposed the truth 😳

https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

The way forward: Open Data

“Open data and content can be freely used, modified, and shared by anyone for any purpose

From https://opendefinition.org/:

From http://opendatahandbook.org/guide/en/what-is-open-data/:

  • Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost
  • Re-use and Redistribution: the data must be provided under terms that permit re-use
  • Universal Participation: everyone must be able to use, re-use and redistribute

Hong Kong and Open Data

  • Hong Kong claims to want to be a Smart City; they even have a PDF about this: https://www.smartcity.gov.hk/doc/HongKongSmartCityBlueprint(EN).pdf

Hong Kong and Open Data

  • Hong Kong government does have an Open Data portal nowadays: https://data.gov.hk/

Hong Kong and Open Data

However, data.gov.hk is missing the juiciest datasets:

  • Companies Registry: details on which limited liability companies are registered, still alive, directors, accounts
  • Land Registry: who owns which piece of land or property, how much was paid, lease conditions

Even the data on data.gov.hk isn't as open as can be; much of the data is in Excel spreadsheets at best, or PDF at worst! These are very hard to use efficiently in computer code.

Hong Kong and Opening Data

If open data isn't available via an easy method, we can go and create our own.

  • I started dataguru.hk as a public open data platform

Opening Data with dataguru.hk

Using web scraping, I have collected publicly available data and cleaned it up for clients and the public to easily access via API:

  • Companies Registry
  • SFC licences
  • Disclosures of Interests (i.e. declared ownership stakes in listed companies)
  • Residential property transactions

 

A similar project is webb-site.com, which doesn't have an API, and is focused on HKEX matters.

Understanding Hong Kong with blog.dataguru.hk

Open Data is a democratic tool for understanding social phenomena. On Data Guru's blog I have revealed some truths that were otherwise hidden:

  • Flats containing the number 4 were cheaper
  • Who admits their stockmarket holdings as late as possible?
  • Why have some companies in Hong Kong changed their names up to 60 times?

 

This is about as journalistic as I can get.

Valuing Hong Kong homes with Open Data and some ML

Now that I've started to open some Hong Kong data, after analysing it, I can create ML solutions. I once had horse race tipper, but I'll say no more about that... Today I will show my latest product, truehome.hk

Valuing Hong Kong homes with Open Data and some ML

We gathered the transaction amounts for 1.9 million transactions for 1.7 million "units" (flats or houses) from over 43,000 buildings, and I am building statistical models for predicting the value of any given flat with this data.

  • Currently I am using flat size, floor, age of building and location to estimate value.
  • Future features could include distance from nearest MTR station, property developer, facilities available...
  • Website is already live, please test now and give feedback!

What next in Hong Kong for Open Data?

Open Data, by its very name, is not proprietary. Once it is collected and disseminated, it is a net win for everyone. Get in touch to:

  • suggest public data sources that are not totally Open yet
  • suggest new applications, especially ML applications, that can be built with Open Data... and not just signals for trading!
  • combine your private data with open data and sprinkle on some ML to get better outcomes, e.g. more money/profit/insights

accessinfo.hk: Opening up Right to Information

In a democracy it is essential that people can access a wide range of information in order to participate in a real and effective way in the matters that affect them.

Public bodies are – or should be – acting as “servants of the people”. That’s why we all have the right of access to the information held by public bodies on our behalf.

International standards and jurisprudence have confirmed that this information belongs to the public.

From https://www.access-info.org/right-to-know:

Right to Information

We don't know yet, because Hong Kong doesn't yet have an RTI law -- but there might be one soon!

How good is Hong Kong's RTI law?

However, Hong Kong already has a framework implementing RTI

Hong Kong has a "a formal framework for access to information held by government departments":

the Code of Access to Information

which is available at access.gov.hk

What can I request under the Code?

You can ask for any information from any Government department!

Except when it's exempt. Clear?

What can I request under the Code?

1.14: The Code does not oblige departments to -
  • acquire information not in their possession
  • create a record which does not exist
  • provide on request information which is already published, either free or at a charge, or
  • provide information available through an existing charged service.

So how do I make a request under this Code?

1.11 Written requests may be made by letter [...] and should be addressed to the Access to Information Officer of the department concerned.

That was a little too 19th-century for me.

So how do I make a request under this Code?

My answer:

accessinfo.hk

Why should I use accessinfo.hk to make a request?

  • No need to find contact information of an "Access to Information Officer", just the name of the department
  • Electronic: access from anywhere
  • Public, so transparent/accountable
  • Public, so information available to everyone

How do I use accessinfo.hk to make a request?

More recent example: the second-most popular request on my site

https://accessinfo.hk/en/request/transport_planning_design_manual

Now what?

Help maintain and improve accessinfo.hk

  • Complete translation into Chinese(s)
  • Ensure custom text for Hong Kong is accurate, especially any legal text
  • Spread the word!
  • Volunteer to assist users

Now what?

Let's keep using accessinfo.hk to open up the HK Government

  • This is only a "political" site in a narrow sense; it is only making it easier to make requests under the Government's own Code on Access to Information
  • It is our right to access the information!
  • If you feel the Government is unnecessarily opaque, use this site to get them to open up

Now what?

New proposed law will have some effect on accessinfo.hk

  • Although the Code on Access to Information was not statutory, I found it mostly worked well, or at least not very differently from, e.g., the UK situation
  • The devil is in the details when it comes to the new proposed law; worst possible change will be charging just to make a request, which could kill accessinfo.hk

Now what?

New proposed law will have some effect on accessinfo.hk

  • Timetable of responding to requests not clearly addressed in consultation paper, but necessary for accessinfo.hk to send reminders
  • Internal reviews can hopefully be done as now, by just asking for them!
  • Communication with Ombudsman should be electronic...!?

Being a data scientist at HK01

  • Technology is absolutely essential in all aspects of journalism; being digital natives, young journalists have that advantage over older ones
  • Technology is also an important driver of business changes, which are necessary to enable more journalism to be supported profitably
  • Data science can improve business process efficiency, as well as improve app and web site usage, and improving reader experience, via e.g. content recommendation, trend understanding

For 6 months I was a data scientist at HK01

Some takeaways:

Made with Slides.com