Data Preparation
Shoichi Yip
Physics BSc @ UniTrento
Actually research assistant for prof. Luca Tubiana
Interested in Stat Mech, Complex Systems, Data Analysis, ML
The goal of this project is the development and application of statistical physics and information theory methods to optimize data collection and forecast epidemics.
In particular I take care of the data retrieval, cleaning, preparation and exploratory data analysis part.
Epidemics on complex networks can be simulated given data-informed models and parameters.
In our case, we would mainly like to take account of mobility factors of humans: how they move, who do they spend their time with and what is the purpose of their movements.
We also need data about the epidemic itself: we would like to know cases, recovers and deaths.
Also the quality of the data is crucial: since we aim to make metapopulation models, and evaluate the different coarse-graining scales with proper metrics, we would like to have a wide range of data scales in order to have meaningful comparisons.
The public (Italy) data we have:
The restricted access data (Italy) we have:
ISTAT datasets
.csv files
ISI Cuebiq dataset
CEEDS-DEMM dataset
Facebook datasets
UNZIP
folders of zips
UNZIP
OpenPolis shapefiles
topojson file
Folders of csvs
BASH
BASH
clean csvs
dictionary
DASK
parquet files
def create_network(coloc_df, nodecolor, datetime, cutoff):
"""
Create network from mobility data.
"""
dt_datetime = pd.to_datetime(datetime)
df = coloc_df[(coloc_df.datetime==dt_datetime) & (coloc_df.nlink_log>cutoff)].compute()
print(f"Dataframe is {len(df)} elements long")
G = nx.from_pandas_edgelist(df, 'adm1', 'adm2', edge_attr='nlink_log')
nodecolordict_fordate = nodecolor.loc[dt_datetime].set_index('adm1').compute().to_dict()['nlink_log']
nx.set_node_attributes(G, nodecolordict_fordate, 'self_nlink')
return G