Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources

Ryan Hafen

@hafenstats

  • EIOS is a collaboration between various public health stakeholders around the globe, led by WHO
  • Mission is early detection, verification and assessment of public health risks and threats using open source information
  • Aimed at consolidating a wide array of endeavors and platforms to build a strong public health intelligence (PHI) community supported by robust, harmonized and standardized PHI systems and frameworks across organizations and jurisdictions 

Epidemic Intelligence from Open Sources (EIOS)

COVID-19 Case Counts

  • Confirmed cases and deaths at different levels of geographic resolution, as provided by health departments, ministries of health, etc.
  • EIOS aims to provide analysts with the ability to:
    • Quickly understand trajectories of counts and related statistics at different levels of geography
    • Observe discrepancies between different data sources
  • Case count considerations
    • Methods for counting vary by health care system
    • Level of testing varies geographically and over time

Case Count Sources

Challenges of Data Standards in Open Data Communities

  • Often not much thought is given to standards
  • When it is, everyone has a different idea of "standard"
  • Often little incentive to adhere to someone else's standard

It's hard to expect strict adherence to a standard for a given type of data, but ideally we would all adhere to some best practices

Example of Bearable Practices - JHU

Example of Bearable Practices 

1. Wide format

Prefer tidy format

  • Each variable is a column
  • Each observation (or case) is a row

 

Why not wide format?

  • Not suitable for analysis
  • Not ideal for version control (every line changes every time, can't tell what changed, bloat)

Example of Bearable Practices 

2. Non-standard date format

Use ISO 8601

Example of Bearable Practices 

3. Using country names as geographic identifiers

Make it difficult to merge with other data

 

More prone to error (even when using provided lookup table - things can change)

 

Should use a country code standard such as ISO 3166-1 alpha-2

Example of Bearable Practices 

4. Mix of country and state/province data

Australia, Canada, and China are broken into provinces while everything else is country-level

 

- Should be consistent and well-documented

- Different files for different geographic levels

Example of Bearable Practices 

5. Three files for three variables (cases, deaths, recovered)

 

These need to be joined to get an analysis dataset

 

All variables would ideally be in one file, one column per variable - back to tidy data principles

Example of Bearable Practices 

6. Ambiguous terms of use and no standard open license

Example of Bearable Practices - JHU

  • Wide format - not tidy
  • Ambiguous / difficult to parse date format
  • Country names used for geographic identifiers
  • Mix of country and state data
  • Different file for each variable

Example of Bearable Practices 

Ambiguous terms of use and no standard open license

Example of Best Practices - New York Times

  • Tidy format
  • ISO 8601 date format
  • Standard geocodes for admin 1 and 2 data (FIPS)
  • State and county-level data are in separate files
  • License is co-extensive with the Creative Commons Attribution-NonCommercial 4.0 International license

Building a Tool for Case Count Data

  • Pull country-level case counts every 5 minutes from the following sources
    • WHO
    • JHU
    • ECDC
    • Worldometer
  • Roll up counts to WHO Region, continent, and global levels
  • Compute statistics of interest for each geographic entity
    • Day-to-day and week-to-week change in new cases / deaths
    • Case fatality rate (# deaths / # of cases)*
    • Attack rate (# cases / population)
    • Etc.

*Does not take time to onset of death into account

Provide a set of visualizations for each geographic entity for the user to interact with

 With Trelliscope these can be navigated interactively

COVID-19 Data Registry

Efforts exist for pulling multiple sources of COVID-19 data together, e.g.

We are working toward a set of data registry tools that enable the open data community to register datasets in a way that conforms to standards but doesn't require the original data provider to change the way they are publishing their data

Potential Future Work

  • Standard schemas and transformers for new data types
    • Mobility data
    • Administrative statistics (capacity, vulnerability, demographics, etc.)
    • Models (IHME, Imperial College, Amherst, etc.)
  • Augmenting interfaces to incorporate this information in insightful ways

Thank You

rhafen@gmail.com

@hafenstats

Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources

By Ryan Hafen

Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources

  • 1,664