Turning Pandas DataFrames to Semantic Knowledge Graph

Grab the slides: slides.com/cheukting_ho/pandas-to-graph


Pyjamas Conf 2021


4th December 2021, worldwide, 24 hours streaming, FREE


Why do we love pandas?

(other than that they are cute 💕)

  • a powerful tool for managing tabular data
  • easy import and export to CSV
  • automatic type conversion (and/or datetime parsing)

But it does have limitations...

(source: https://medium.com/swlh/converting-nested-json-structures-to-pandas-dataframes-e8106c59976e)

It's just not the best to handle nested data

How about we put nested data in graph format?

Semantic Knowledge Graph

  • graph-structured data model
  • store interlinked descriptions of entities
  • describe data with properties
  • fully customizable schema

Developing a tool to convert pandas DataFrame to Knowledge Graph

1) What will the schema be like

  • flat structure
  • what are the data types
  • any linked data
np_to_buildin = {
    v: getattr(builtins, k)
    for k, v in np.typeDict.items()
    if k in vars(builtins)
np_to_buildin[np.datetime64] = dt.datetime

2) How to load in data

  • size limitation
  • load in chunks -> datatype mismatch
  • export in records (records are objects)
with pd.read_csv(csv_file, sep=sep, chunksize=chunksize) as reader:
obj_list = df.to_dict(orient="records")

3) NAs, what about them?

  • skip record all together
  • make it optional
  • data cleaning
if any(df.isna().any()) and na == "error":
    raise RuntimeError(
        f"{df}\nThere is NA in the data and cannot be automatically load in. Use --na options to remove all records with NA or make properties optional to accept missing data."
elif na == "skip":
if na == "optional":
    bad_key = []
    for key, value in item.items():
      if pd.isna(value):
        for key in bad_key:

How about the other way round?

Exporting knowledge graph to CSV

The flattening procedure

  1. record as DataFrame
  2. Expanding (de-nesting) it
  3. Add embedded objects
df = pd.DataFrame().from_records(list(all_records))
expanded = pd.json_normalize(df[col])
expanded.columns = list(map(lambda x: col + "." + x, expanded))
df.drop(columns=col, inplace=True)
df = df.join(expanded)

If you like knowledge graph

Small Demo