Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources

Ryan Hafen

@hafenstats

https://slides.com/hafen/covid19-casecounts

EIOS is a collaboration between various public health stakeholders around the globe, led by WHO
Mission is early detection, verification and assessment of public health risks and threats using open source information
Aimed at consolidating a wide array of endeavors and platforms to build a strong public health intelligence (PHI) community supported by robust, harmonized and standardized PHI systems and frameworks across organizations and jurisdictions

https://www.who.int/eios

Epidemic Intelligence from Open Sources (EIOS)

COVID-19 Case Counts

Confirmed cases and deaths at different levels of geographic resolution, as provided by health departments, ministries of health, etc.
EIOS aims to provide analysts with the ability to:
- Quickly understand trajectories of counts and related statistics at different levels of geography
- Observe discrepancies between different data sources
Case count considerations
- Methods for counting vary by health care system
- Level of testing varies geographically and over time

Case Count Sources

Global (country-level)
- Johns Hopkins CSSE
- European CDC
- WHO
- Worldometer
- Others
United States (state and county-level)
- Johns Hopkins CSSE
- New York Times
- USA FACTS
- Others

Challenges of Data Standards in Open Data Communities

Often not much thought is given to standards
When it is, everyone has a different idea of "standard"
Often little incentive to adhere to someone else's standard

It's hard to expect strict adherence to a standard for a given type of data, but ideally we would all adhere to some best practices

Example of Bearable Practices - JHU

Example of Bearable Practices

1. Wide format

Prefer tidy format

Each variable is a column
Each observation (or case) is a row

Why not wide format?

Not suitable for analysis
Not ideal for version control (every line changes every time, can't tell what changed, bloat)

Example of Bearable Practices

2. Non-standard date format

Use ISO 8601

Example of Bearable Practices

3. Using country names as geographic identifiers

Make it difficult to merge with other data

More prone to error (even when using provided lookup table - things can change)

Should use a country code standard such as ISO 3166-1 alpha-2

Example of Bearable Practices

4. Mix of country and state/province data

Australia, Canada, and China are broken into provinces while everything else is country-level

- Should be consistent and well-documented

- Different files for different geographic levels

Example of Bearable Practices

5. Three files for three variables (cases, deaths, recovered)

These need to be joined to get an analysis dataset

All variables would ideally be in one file, one column per variable - back to tidy data principles

Example of Bearable Practices

6. Ambiguous terms of use and no standard open license

Non-standard and too-restrictive terms can impede the progress of science
Ideally for open data should use a standard license such as Creative Commons International 4.0 license

Example of Bearable Practices - JHU

Wide format - not tidy
Ambiguous / difficult to parse date format
Country names used for geographic identifiers
Mix of country and state data
Different file for each variable

Example of Bearable Practices

Ambiguous terms of use and no standard open license

Non-standard and too-restrictive terms can impede the progress of science
Ideally for open data should use a standard license such as Creative Commons International 4.0 license

Example of Best Practices - New York Times

Tidy format
ISO 8601 date format
Standard geocodes for admin 1 and 2 data (FIPS)
State and county-level data are in separate files
License is co-extensive with the Creative Commons Attribution-NonCommercial 4.0 International license

Building a Tool for Case Count Data

Pull country-level case counts every 5 minutes from the following sources
- WHO
- JHU
- ECDC
- Worldometer
Roll up counts to WHO Region, continent, and global levels
Compute statistics of interest for each geographic entity
- Day-to-day and week-to-week change in new cases / deaths
- Case fatality rate (# deaths / # of cases)*
- Attack rate (# cases / population)
- Etc.

*Does not take time to onset of death into account

Provide a set of visualizations for each geographic entity for the user to interact with

With Trelliscope these can be navigated interactively

https://covid19-us-casecounts.netlify.app

COVID-19 Data Registry

Efforts exist for pulling multiple sources of COVID-19 data together, e.g.

We are working toward a set of data registry tools that enable the open data community to register datasets in a way that conforms to standards but doesn't require the original data provider to change the way they are publishing their data

Potential Future Work

Standard schemas and transformers for new data types
- Mobility data
- Administrative statistics (capacity, vulnerability, demographics, etc.)
- Models (IHME, Imperial College, Amherst, etc.)
Augmenting interfaces to incorporate this information in insightful ways

Thank You

rhafen@gmail.com

@hafenstats

https://slides.com/hafen/covid19-casecounts

Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources

By Ryan Hafen

Collecting and Visualizing COVID-19 Case Count Data from Multiple Open Sources

1,893