The University of Manchester
Pollen Data:
50%+ missing (measurements are seasonal)
Air Quality Data:
5-12% missing (per pollutant)
Meteorology:
1.5-5.5% missing (per variable)
Use Machine Learning to iteratively fill gaps in timeseries data
Requirements:
Use the python based scikit-learn iterative imputation library (not as complete as the R MICE library, but easier to integrate into our python workflow)
(default imputation method in scikit-learn)
Assumptions:
Meteorological Data
Pollution Data
Raw Probability Distributions are not necessarly normal
e.g. temperature / RH for Kirkwall (Orkney Islands)
Solution: QuantileTransformer(output_distribution='normal')
Extreme bi-modal distribution impervious to transformation
Solution: Don't impute RH data - impute Temperature and Dew-point Temperature, then derive RH from these
(metpy.calc.relative_humidity_from_dewpoint)
Assumptions:
Tools are provided with the processing toolkit for users to carry out these tests for themselves
50% data removal (2016 & 2017)