AIACE Project
Data Preparation
Who am I?
Shoichi Yip
Physics BSc @ UniTrento
Actually research assistant for prof. Luca Tubiana
Interested in Stat Mech, Complex Systems, Data Analysis, ML
What am I doing
The goal of this project is the development and application of statistical physics and information theory methods to optimize data collection and forecast epidemics.
In particular I take care of the data retrieval, cleaning, preparation and exploratory data analysis part.
What data do we need
Epidemics on complex networks can be simulated given data-informed models and parameters.
In our case, we would mainly like to take account of mobility factors of humans: how they move, who do they spend their time with and what is the purpose of their movements.
We also need data about the epidemic itself: we would like to know cases, recovers and deaths.
What data do we need
Also the quality of the data is crucial: since we aim to make metapopulation models, and evaluate the different coarse-graining scales with proper metrics, we would like to have a wide range of data scales in order to have meaningful comparisons.
What data do we have
The public (Italy) data we have:
- COVID-19 cases, recovers and deaths at NUTS3 level (CEEDS-DEMM);
- daily deaths baseline and crisis counts at LAU level (ISTAT);
- touristicity index at LAU level (ISTAT);
- NUTS3 and LAU unit shapefiles (OpenPolis).
- OD matrix, gyration radius and average degree distribution at NUTS3 level (ISI / Cuebiq)
- COVID-19 cases, recovers and deaths at LAU level:
- Bolzano/Bozen;
- Umbria;
- Marche;
- Friuli-Venezia Giulia.
What data do we have
The restricted access data (Italy) we have:
- Colocation data at NUTS3 level (Facebook DfG);
- Movement data at NUTS3 level (Facebook DfG);
- Population data at NUTS3 level (Facebook DfG);
- Movement data at Bing Tiles 16 level (Facebook DfG);
- Population data at Bing Tiles 16 level (Facebook DfG);
- COVID-19 cases, recovers and deaths at LAU level:
- Toscana;
- Molise;
- Veneto.
🌍 Data Avilability
📆 Data Avilability
Retrieval
ISTAT datasets
.csv files
ISI Cuebiq dataset
CEEDS-DEMM dataset
Facebook datasets
UNZIP
folders of zips
UNZIP
OpenPolis shapefiles
topojson file
Folders of csvs
BASH
BASH
clean csvs
dictionary
DASK
parquet files
Why parquet files?
- Column-oriented data storage
- Fast retrieval for columnar operations
- Smaller storage space
Preprocessing
- Ensure utf8 encoding
- Reduce file size and correctly select only useful data (WIP)
- Ensure that columns are read correctly
- Ensure consistency across datasets for common features (aka admin units)
Encoding
Text Wrangling
Consistency
Admin units
Admin Units
Preparation
- Transformations and scaling
- Tabular data preparation
- Writing functions for graph preparation
- Preparing dataviz applications (WIP)
- Preparing pipeline for automation (WIP)
- Documenting datasets and scripts (WIP)
def create_network(coloc_df, nodecolor, datetime, cutoff):
"""
Create network from mobility data.
"""
dt_datetime = pd.to_datetime(datetime)
df = coloc_df[(coloc_df.datetime==dt_datetime) & (coloc_df.nlink_log>cutoff)].compute()
print(f"Dataframe is {len(df)} elements long")
G = nx.from_pandas_edgelist(df, 'adm1', 'adm2', edge_attr='nlink_log')
nodecolordict_fordate = nodecolor.loc[dt_datetime].set_index('adm1').compute().to_dict()['nlink_log']
nx.set_node_attributes(G, nodecolordict_fordate, 'self_nlink')
return G
Stack
Thanks!
Aiace Data Prep
By Shoichi Yip
Aiace Data Prep
- 290