Better Data Governance with Revision Control Workflows

Do you handle data at work?

Show of hands 🤚🏻

We all know the struggle...

Sharing data and managing who have what access to data should not be painful 

I believe...

We should solve this problem

Engineers are very good at solving problems...

Let's see what software engineers does

Have you heard about this thing called... git?

With git, you can...

Branches - work on your own version

Rollback - go back if something went wrong

Diff - compare changes

Merge - combine changes

Platform for multiuser access to projects (repos)

Open-source software made into the mainstream thanks to GitHub and other similar platforms that make use to features of git.

Can we use git for datasets?

Not ideal......

Database + git ?

What about ML models and pipelines?

Two ways to do this:

  • Version control tools to be used with databases (e.g. Liquibase)
     
  • A version control database (e.g. DVC)

VC Tool

  • Plug in and use
     
  • Easy migration
     
  • Most works with SQL DBs
     
  • Monitor pipelines

VC Database

  • Optimise for the DB
     
  • Choose a DB that suits your data
     
  • No extra components in the pipeline

Many tools that is made for SQL databases

Not so much for the graph database folks...

Why I think TerminusDB is special

  • It's a graph database
     
  • It's open-source
     
  • Can be used with other databases
     
  • Can store your ML models

TerminusDB is an open-source graph database

With a graph construct, something just comes naturally...

What are they?

Rollbacks

In TerminusDB commits are logged

Make backing up more organized and manageable
(no more final_final versions)

you can rollback to any commits

Branches

Branches is builtin for TerminusDB

You can create branches at any point

(from any commits)

Thus having multiple versions

Diff (Patch)

Compare difference in document (data object)

Allow preview of changes, approve of changes before applying the changes

Works on any json objects

Try mixing it with other databases

Bonus: fluid in data format

You can store any data types

Which means your ML model can be stored in the database as well

Keep data an ML model at same places

So when should I version control my data?

You will never know until you try...

Better Data Governance with Revision Control Workflows