Old school auditing

with the new JSONB stuff

{ "name": "Felix Kunde" , "email": "felix-kunde@gmx.de"}

Slides: http://slides.com/fxku/pgmemento_foss4gna16


I'm Felix.

I'm from Berlin.

Working at a university.

Mining on traffic sensor data.

Fell in love with PostGIS 5yrs ago.

Used it for 3d city models.

Missed versioning.

Did it myself.


Who got me here?


Related projects


Initial Idea:

Version control for 3D city models

Version control inside the DBMS



(Lohse 2007)

Linear versioning usecases

A link to the past


Hierarchical versioning

City planning scenarios


3D City Database

  • Open source project to store CityGML-based city models in PostGIS (github.com/3dcitydb)

Existent solution for 3DCityDB

  • Version control of a proprietary DBMS

  • Linear versioning with timestamps

  • Working in different "workspaces" (hierachical versioning)

  • Merging of workspaces is possible

  • Great in theory, but working not so well in the field ...


  • Clear separation between production and archive state

  • Logs should be easy to access and editable

  • Performance of queries shall not decrease

How to do it?

  • Use triggers to catch change events

  • Do something with the OLD version

  • Provide queryable versions for each row

  • Going back to a previous DB state would be nice

Everything in one place

  • Create a validity range for each row

  • OLD and NEW together in one table

  • Versioning is easy, but tables get messy

  • Include time condition for each query

  • How to delete logs?

QGIS DB Manager Versioning works like this

Everything in another place

  • Putting OLD into another table or schema

  • Validity range for each row in history table

  • Include time condition only for versioning queries

  • Dynamic time travel through views

  • How to delete logs?

Some examples for history tables

tablelog by Andreas Scherbaum

temporal_tables extension by Vlad Arkhipov

Tardis approach by Magnus Hagander

table_version extension by Jeremy Palmer

Generic auditing

Why have a history table for each audited table,

when you can just have one?

OLD can be converted into into a semi-structured data type:


Don't need to care for DDL changes!


Craig Ringer does this with audit trigger 91plus (hstore)

And pgMemento does it as well.

Oh nice, but ...

  • Versioning gets more complicated then

  • Querying these logs is different (API changes)

  • Querying relational history tables is probably faster

  • DDL changes could be managed by event triggers

Disk space

(C) virualcitySYSTEMS 2016

Geometry, rasters and BLOBs

  • Logging complex objects even when they have not changed?

  • Calculate delta between OLD and NEW

  • Log only changed columns (need comparison)

  • Creating deltas produces unstructured logs

  • These could be handled nicely with hstore, JONSB etc.

  • Trigger overhead grows further

Pivoting changes

  • For each changed column generate a new row in a log table

  • column_name | pk | txid | tz | op | old_value | new_value

  • Cyan Audit by Moshe Jacobsen does it this way

  • Enabling / disabling auditing on column-level

PostgreSQL Logical Decoding

  • New API since PG 9.4 to access replication streams

  • Decoding the WAL logs

  • Write changes to replication slots

  • Auditing and versioning are both possible

  • Output plugins: WAL-Format >> wished format

  • More Infos: DocuFOSS4G NA, Postgres Open, FOSDEM, PgConfRu 

    Auditing using Logical Decoding

    Auditing transaction meta data (no data log):

    • pgaudit by Abhijit Menon-Sen & Ian Barwick (2nd Quadrant)
    • pgaudit fork by David Steele (Crunchy Data)

    Audit log to JSONB, to restore in a (PostGIS) database:

    (there might a talk about it @foss4g in Bonn)

    Hierarchical version control

    of spatial data

    Use case: Multiple users working in their own branches

    How it all started with pgMemento

    Table row >> row_to_json(OLD) >> JSON

    JSON >> json_populate_record(null::test_table, JSON) >> Table row

    Generic tracking of changes

    Log tables

    DML log tables

    • Transaction_log: Meta data about transactions

    • Table_event_log: Meta data about table events related to txids

    • Row_log: Table for data log referenced to events

    DDL log tables

    • Audit_table_log: Shows which tables exist for txid range

    • Audit_column_log: Shows which columns exist for txid range

    Statement-level triggers

    • BEFORE DML changes log infos about current txid (UNIQUE)

    • BEFORE DML changes log infos about table event

    • BEFORE TRUNCATE statement log all the removed data

    Row-level triggers

    • AFTER insert log inserted rows without the data (use first event ID for current txid)

    • AFTER update log deltas of updated row (use first event ID for current txid)

    • AFTER delete log data of deleted rows (use last event ID for current txid)

    Event triggers

    • Tag 'CREATE TABLE': Enable auditing for new table and fill DDL logs

    • Tag 'ALTER TABLE': Update the audit_column_log

    • Tag 'DROP TABLE': Update the DDL logs

    • Drop-Events can be narrowed down to single tables (PG 9.4) pg_event_trigger_dropped_objects()

    • Postgres 9.5 got us pg_event_trigger_ddl_commands()

    Additional audit id for each table

    • A unique ID is necessary to trace a row in the logs

    • Primary key could be used but it can consist of multiple columns

    • pgMemento adds an additional audit_id column to a table

    • There is only one global sequence for all audit_ids

    INSERT case

    ... and how it affects the log tables

    UPDATE case

    UPDATE building SET measured_height = 12.8 WHERE id = 1;

    Creating the deltas

      (SELECT ('{' 
               || string_agg(to_json(key) || ':' || value, ',') 
               || '}') 
    FROM jsonb_each(row_to_json(OLD)::jsonb)
    WHERE NOT ('{' || to_json(key) || ':' || value || '}')::jsonb <@ row_to_json(NEW)::jsonb
    ), '{}')::jsonb AS delta


    DELETE case

    Fast queries thanks to GIN index

    CREATE INDEX row_log_changes_idx ON pgmemento.row_log USING GIN (changes); 

    JSON query: Which transactions has changed a certain column?

    SELECT t.txid 
    FROM pgmemento.transaction_log t
    JOIN pgmemento.table_event_log e ON t.txid = e.transaction_id
    JOIN pgmemento.row_log r ON r.event_id = e.id
      r.audit_id = 4 
      (r.changes ? 'measured_height'); 

    ->  Index Scan using row_log_audit_idx on row_log r
          Index Cond: (audit_id = 4)
          Filter: (changes ? 'measured_height'::text) 

    JSON query 2: Searching for a specific key-value combination

    SELECT DISTINCT audit_id FROM pgmemento.row_log
      changes @> '{"measured_height": 0.0}'::jsonb; 

    -> Bitmap Index Scan on row_log_changes_idx
         Index Cond: (changes @> '{"measured_height": 0.0}'::jsonb)

    Restore a former table

    or database state

    SELECT pgmemento.restore_schema_state(
     1,                     -- Start of time slice
     1002,                  -- End of time slice
     'citydb',              -- Source schema
     'state_before_delete', -- Target schema
     'TABLE',               -- Relation type
     0                      -- Update restored state (views only)

    First check: Which tables existed for the upper txid boundary
    (audit_table_log table is queried)

    What happens internally?

    • Get valid rows for the given time slice
    • Fetch Audit_IDs before Tx (using DISTINCT)
    • Exclude Audit_IDs of already deleted entries

    Get columns of a table for given txid

    • Therefore, query the audit_column_log table

    SELECT column_name
      FROM pgmemento.audit_column_log 
      WHERE table_relid = 123456
        AND txid_range @> 1002::numeric

    Create temporary table

    • For the last step we need a table as a template

    • Template should be historically correct

    • Create a temporary table from the audit_column_log

    • If restoring as tables these templates will be dropped on commit

    • If restoring as views these templates need to be preserved

    Find old values

    • For each audit id loop over found columns

    • For each column search for the first appearance inside the data log

    • If nothing was found, query the recent state of the table

    SELECT 'my_column' AS key, COALESCE(
    (SELECT (r.changes -> 'my_column')
    FROM pgmemento.row_log r
    JOIN pgmemento.table_event_log e ON r.event_id = e.id
    JOIN pgmemento.transaction_log t ON t.txid = e.transaction_id
    WHERE t.txid >= 1002
    AND r.audit_id = 40
    AND (r.changes ? 'my_column')
    ORDER BY r.id LIMIT 1

    (SELECT COALESCE(to_json(my_column), NULL)::jsonb
    FROM my_schema.my_table
    WHERE audit_id = 40
    ) AS value

    Find old values

    Connect JSONB fragments

    ) -- end of WITH block, that has collected valid audit_ids 
    SELECT v.log_entry FROM valid_ids f 
    JOIN LATERAL ( -- has to be applied for each audit_id
      SELECT json_build_object( 
        q1.key, q1.value, 
        q2.key, q2.value,
        ...)::jsonb AS log_entry 
      FROM ( -- search for old values (see previous slide)
        SELECT q.key, q.value FROM ( 
          SELECT 'id' AS key, r.changes -> 'id' 
    FROM pgmemento.row_log r ...
    )q ) q1,
    SELECT q.key, q.value FROM (
    SELECT 'name' AS key, r.changes -> 'name'
    FROM pgmemento.row_log r ...
    )q ) q2, ...

      Restoring the table

      ) -- previous steps are encapsulated in a WITH block
      SELECT p.* FROM restore rq
        JOIN LATERAL (
          SELECT * FROM jsonb_populate_record(
            null::my_temp_table,   -- template
            rq.log_entry           -- concatenated JSONB
        ) p ON (true) 

      Revert transactions

      Get all affected rows of a given table event

      Case 1: If deleted >> INSERT
      Case 2: If updated >> UPDATE
      Case 3: If inserted >> DELETE

      Query to prepare the revert

       SELECT r.audit_order, r.audit_id, r.changes, 
      a.schema_name, a.table_name, e.op_id
      FROM pgmemento.table_event_log e
      JOIN pgmemento.audit_table_log a ON a.relid = e.table_relid
      JOIN pgmemento.transaction_log t ON t.txid = e.transaction_id
      CASE WHEN e.op_id > 2 THEN
      rank() OVER (ORDER BY audit_id ASC)
      rank() OVER (ORDER BY audit_id DESC)

      END AS audit_order,
      audit_id, changes
      FROM pgmemento.row_log
      WHERE event_id = e.id
      ) r ON (true)
      WHERE upper(a.txid_range) IS NULL
      AND t.txid = tid
      ORDER BY e.id DESC, audit_order ASC

      Revert transactions

      • Reverts are processed in a loop to guarantee a correct order

      • Foreign key violations are avoided (unless no cascade stuff is done)

      • Can also be processed for a range of transactions

      • If there are no foreign keys only the oldest event per row should be reverted, because it's faster


      • Of course, it slows down DDL and DML operations

      • No auditing for constraints, indexes and other DB objects

      • Restored views disappear when session ends

      • No merging of schemas

      • No partition strategy for data log table

      • More thoughts on restrictions and protection for log tables

      • No GUI, just SQL coding at the moment

      • Only supports Postgres 9.4 or higher

      Additional notes

      • Probably not the right option for write-extensive databases (even though only deltas are stored)

      • Manipulation of log tables is very easy at the moment

      • pgMemento is no substitution for backup and recovery tools



      Any questions ?


      Felix Kunde


      +4930 4504 3824

      See you in Bonn!


      By fxku


      • 3,791