Ryan Hafen
It's hard to expect strict adherence to a standard for a given type of data, but ideally we would all adhere to some best practices
1. Wide format
Prefer tidy format
Why not wide format?
2. Non-standard date format
Use ISO 8601
3. Using country names as geographic identifiers
Make it difficult to merge with other data
More prone to error (even when using provided lookup table - things can change)
Should use a country code standard such as ISO 3166-1 alpha-2
4. Mix of country and state/province data
Australia, Canada, and China are broken into provinces while everything else is country-level
- Should be consistent and well-documented
- Different files for different geographic levels
5. Three files for three variables (cases, deaths, recovered)
These need to be joined to get an analysis dataset
All variables would ideally be in one file, one column per variable - back to tidy data principles
6. Ambiguous terms of use and no standard open license
Ambiguous terms of use and no standard open license
*Does not take time to onset of death into account
Provide a set of visualizations for each geographic entity for the user to interact with
With Trelliscope these can be navigated interactively
Efforts exist for pulling multiple sources of COVID-19 data together, e.g.
We are working toward a set of data registry tools that enable the open data community to register datasets in a way that conforms to standards but doesn't require the original data provider to change the way they are publishing their data
rhafen@gmail.com