How to version my spatial database?

 

Felix Kunde - @flxku

Research Assistant

MAGDa

https://projekt.beuth-hochschule.de/magda/

10 yrs time series from sensors

Floating Car Data from 500 taxis

Goal1: Crowdsourced Traffic Plattform

Goal2: Mobility Services

Goal3: Spatial Data Streaming / Mining

Goal4: Forecasting / Anomaly Detection

Methods: Statistical & Machine Learning (Python), Linear Referencing (PostGIS), Routing (Graphhopper, Java)

http://excell-mobility.com/

Words

aka What are we talking about?

Provenance*

Keep History of Your data

Lineage*

Store History about Your data

Audit trail / Commit Log

Store History about transformations to Your data (Change Data Capture)

Bi-Temporal*

Lineage + audit Trail

Event SourCing*

Derive State from immutable list of Events

VersioN Control*

Keep Multiple versions of Your dataBASE

MVCC*

Keep Multiple versions of Your data For Concurrent Workflows

Evolution*

Define semantic versions of you data schema when it evolves

Actions

What to do with it?

Query

Browse through Data History

Restore

Recreate a previous version

RePair

Undo certain transformations

RePlay

Debug your log

(Or recover from it)

Long Running Transactions

What you get when Using GIS but with conflict handling

Merge

Union between two versions (Branches)

Migration

Change the version while running In ProductioN mode

v1

v2

v3

Tools

What Can be used

Internal

Versioning inside your database

DIY

Additional Columns, History Tables, Database Trigger

Temporal_Tables extension*

Additional Columns, Relational History Tables, Database Trigger

See also: table_log*, table_version*, time_travel* etc.

Pros

Simple setup

Writes are still fast

Easy access of past data

 

Cons

No support for schema changes

No repair functionality

No branching

Data grows and grows and grows (well, general provenance problem)

pgMemento* "extension"

Generic auditing (JSONB), Transactions instead timestamps, (Event) Trigger

See also: Audit_trigger_91plus*, CyanAudit*

Pros

Can deal with schema changes

JSONBs logs are queryable

Inspect past transformations

Rollback certain transactions covering many tables

Cons

Some performance overhead

Restore tuples is complex

Yet no branching

No audit for all DB objects (Constraints, Indexes, Grants, Extensions etc.)

QGIS Versioning*

QGIS Plugin, concurrent editing, Conflict resolution

See also: pgVersion*

Pros

Easy to setup

QGIS as frontend

PostGIS <> SpatiaLite architecture possible

Partial checkouts

Compare revisions

Cons

Some limitations for table designs

QGIS as frontend

PG-Revisions as database views

Only for GIS-like database schemas

No merging of branches

Doesn'T PostGres have a transaction Log?

Why noT scan It instead of using triggers?

Logical Decoding*

Use decoded replication streams, Output to whatever you want

See also: decoderbufs*, JSONCDC*, logical_decoding (JSON) *, bottled water (Avro > Kafka)*

Pros

No effect to writes

Robust order of commits

Many output formats possible

Auditing in a distributed system (logical replication)

Cons

Lack of available tools

Harder to implement than simply writing triggers

No automatic DDL support (but hey, there's pgAudit*)

Mind replication slots

 

EXTernal

Versioning Outside of your database

DIY

Put Database files under Version control (Git etc.)

Geogig*

Edit your Geometry in a Git Way

Pros

Hey, it works like Git!

branch, merge, revert etc.

QGIS and GeoServer plugins

Works with complex DBs

Good documentation

Active development

Cons

How to query previous states via SQL?

Cache limits for PG imports?

FlyWay*

Schema Migration

See also: Liquibase*, Pyrseas*, PUM*, RoundhouseE* etc.

Pros

Simple API, easy config

Defines changes in SQL

Your DB has a version

Idempotence paradigm

Cons

Do not change a migration

No rollback, better redo everything

No diff between revisions

Decide

What you Need

WHEN

Audit Trail (CDC)

How

When you need to know what happened.

Good for short running transactions.

Not optimal if there are many data updates.

Version Control

Concurrent editing with long running transactions which requires conflict resolution

Migration

When your database runs in production (continious integration)

Lineage (Metadata)

Always

Thank YOU! QUESTIONS?

slides.com/fxku/foss4g17_dbversion

There's more

Appendix

A detailed Comparison between tools

Tool Method Log type Revision
timetravel Audit Trail Extra Columns Timestamps
temporal_tables Audit Trail Shadow Tables Timestamps
table_version
Audit Trail Shadow Tables UD revision
table_log
Audit Trail Shadow Tables Trigger seq
audit_trigger Audit Trail Generic (hstore) Transactions
pgMemento Audit Trail Generic (jsonb) Transactions
CyanAudit Audit Trail Generic (pivot) Transactions
pgVersion Version Ctl Extra Columns UD revision
QGIS Versioning Version Ctl Extra Columns UD revision
GeoGig Version Ctl External (binary) UD revision
Flyway Migration External (SQL) UD revision
Liquibase Migration External (XML) UD revision
Tool Tracing Var What is logged?
timetravel Upper bound + PK rows
temporal_tables Revision ID + PK rows (ignore columns)
table_version Revision ID + PK rows
table_log Trigger seq + PK rows
audit_trigger - rows (ignore columns)
pgMemento Synth key changed columns
CyanAudit PK array rows (toggle columns)
pgVersion Revision ID + PK rows
QGIS Versioning Revision ID + PK rows
GeoGig PK or user-defined changesets
Flyway
Liquibase
Tool Restore Revert
timetravel SQL -
temporal_tables SQL -
table_version SQL -
table_log SQL + API API (simple DBs)
audit_trigger SQL -
pgMemento SQL + API API (complex DBs)
CyanAudit SQL + API API (simple DBs)
pgVersion -
QGIS Versioning SQL -
GeoGig API API
Flyway API -
Liquibase API API
Tool DDL changes LRT / Branches / Merging
timetravel - -
temporal_tables - -
table_version - -
table_log - -
audit_trigger - -
pgMemento tables, columns -
CyanAudit columns -
pgVersion - yes / no / no
QGIS Versioning - yes / yes /no
GeoGig tables, columns yes / yes / yes
Flyway DB objects
Liquibase DB objects

Audit Trail Benchmarking

Kunde F., Sauer P. (2017) pgMemento – A Generic Transaction-Based Audit Trail for Spatial Databases. In: Gertz M. et al. (eds) Advances in Spatial and Temporal Databases. SSTD 2017. Lecture Notes in Computer Science, vol 10411. Springer, Cham

Made with Slides.com