Turning Pandas DataFrames to Semantic Knowledge Graph

Cheuk Ting Ho

Cheukting

@cheukting_ho

https://cheuk.dev

https://www.twitch.tv/cheukting_ho

Grab the slides: slides.com/cheukting_ho/pandas-to-graph

https://www.twitch.tv/cheukting_ho

Pyjamas Conf 2021

4th December 2021, worldwide, 24 hours streaming, FREE

https://pyjamas.live

Why do we love pandas?

(other than that they are cute 💕)

a powerful tool for managing tabular data
easy import and export to CSV
automatic type conversion (and/or datetime parsing)

But it does have limitations...

(source: https://medium.com/swlh/converting-nested-json-structures-to-pandas-dataframes-e8106c59976e)

It's just not the best to handle nested data

How about we put nested data in graph format?

Semantic Knowledge Graph

graph-structured data model
store interlinked descriptions of entities
describe data with properties
fully customizable schema

Developing a tool to convert pandas DataFrame to Knowledge Graph

1) What will the schema be like

flat structure
what are the data types
any linked data

np_to_buildin = {
    v: getattr(builtins, k)
    for k, v in np.typeDict.items()
    if k in vars(builtins)
}
np_to_buildin[np.datetime64] = dt.datetime

2) How to load in data

size limitation
load in chunks -> datatype mismatch
export in records (records are objects)

with pd.read_csv(csv_file, sep=sep, chunksize=chunksize) as reader:

obj_list = df.to_dict(orient="records")

3) NAs, what about them?

skip record all together
make it optional
data cleaning

if any(df.isna().any()) and na == "error":
    raise RuntimeError(
        f"{df}\nThere is NA in the data and cannot be automatically load in. Use --na options to remove all records with NA or make properties optional to accept missing data."
    )
elif na == "skip":
    df.dropna(inplace=True)

if na == "optional":
    bad_key = []
    for key, value in item.items():
      if pd.isna(value):
        bad_key.append(key)
        for key in bad_key:
          item.pop(key)

How about the other way round?

Exporting knowledge graph to CSV

The flattening procedure

record as DataFrame
Expanding (de-nesting) it
Add embedded objects

df = pd.DataFrame().from_records(list(all_records))

expanded = pd.json_normalize(df[col])
expanded.columns = list(map(lambda x: col + "." + x, expanded))
df.drop(columns=col, inplace=True)
df = df.join(expanded)