JSON IN POSTGRESQL

CONTEXT

Java + PostgreSQL stack
data persisted as a mix of structured & unstructured

WHY MIX?

need transaction/relational capabilities for business entities
some data only meant for visualization
and/or not relevant for business logic
and/or unsutructured/semistructured
and/or not yet clear how to use it (spike?)

Normalizing in tables and formalizing structure with a schema is not useful for this portion of data, and not worth the effort. Just shove it in a document-like column

THEN

For whatever reason, some of that data becomes relevant (e.g: new feature).

But it's still not a first-class entity.

REFACTOR SCHEMAS?

might be just part of a query criterion
(e.g:all devices with log lines at error level)
still no clear structure
arbitrary fields, deep nested structures...
no direct manipulation (read/append only)
read & modify something else in same transaction

So it's still simpler to keep it as documents.

DOWNSIDES?

have to load in memory and deserialize the data
may be lot of documents (full table scan?)
may require lots of gc
lot of data moved from db to application without being used

Need query capabilities on the documents!

POSTGRES 9.3 TO THE RESCUE

JSON type
(text field + automatic validity check)
ad-hoc operators and functions
(objects/array navigation, expansion)

JSON QUERY 9.3

Data

-[ RECORD 1 ]----------------------------------------------------------------
id   | 1
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
        {"level": "ERROR", "text":"zot","timestamp":1425029952108},
        {"level": "WARN", "text":"warnever","timestamp":1425029952110}]
-[ RECORD 2 ]----------------------------------------------------------------
id   | 2
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
        {"level": "WARN", "text":"warnever","timestamp":1425029952110}]
-[ RECORD 3 ]----------------------------------------------------------------
id   | 3
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
        {"level": "WARN", "text":"warnever","timestamp":1425029952110}]

JSON QUERY 9.3

"All devices with log lines at error level"

sample=# select distinct on (d.id)
sample-#   d.*
sample-# from devices d,
sample-#      json_array_elements(d.logs) l
sample-# where l->>'level' = 'ERROR';
 id |                                                                             
----+-----------------------------------------------------------------------------
  1 | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},{"l...

JSON QUERY 9.3

Query plan

                            QUERY PLAN                              
-----------------------------------------------------------------------------------------
 Unique  (cost=1832.25..1838.05 rows=200 width=40) (actual time=0.090..0.091 rows=1 loops
   ->  Sort  (cost=1832.25..1835.15 rows=1160 width=40) (actual time=0.090..0.090 rows=1 
         Sort Key: d.id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.00..1773.20 rows=1160 width=40) (actual time=0.046..0.0
               ->  Seq Scan on devices d  (cost=0.00..21.60 rows=1160 width=40) (actual t
               ->  Function Scan on json_array_elements l  (cost=0.00..1.50 rows=1 width=
                     Filter: ((value ->> 'level'::text) = 'ERROR'::text)
                     Rows Removed by Filter: 2
 Planning time: 0.112 ms
 Execution time: 0.134 ms
(11 rows)

(Interesting part is the Nested Loop, sorting and unique is for the DISTINCT clause)

YAY! PARTY? NO.

Complex queries that spans multiple documents and/or deep nested structures might be very slow

JSON is actually a text+check type
pg has to reparse it everytime to access
Have to full-table scan anyway

POSTGRES 9.4 TO THE RESCUE

JSONB type
(binary representation, indexable)
performance improvements

JSON QUERY 9.4

"All devices with log lines at error level"

sample=# select distinct on (d.id)
sample-#   d.*
sample-# from devices d,
sample-#      jsonb_array_elements(d.logs) l
sample-# where l->>'level' = 'ERROR';
 id |                                                                             
----+-----------------------------------------------------------------------------
  1 | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},{"l...

JSON QUERY 9.4

Query plan

                            QUERY PLAN                              
-----------------------------------------------------------------------------------------
 Unique  (cost=1832.25..1838.05 rows=200 width=40) (actual time=0.090..0.091 rows=1 loops
   ->  Sort  (cost=1832.25..1835.15 rows=1160 width=40) (actual time=0.090..0.090 rows=1 
         Sort Key: d.id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.00..1773.20 rows=1160 width=40) (actual time=0.046..0.0
               ->  Seq Scan on devices d  (cost=0.00..21.60 rows=1160 width=40) (actual t
               ->  Function Scan on jsonb_array_elements l  (cost=0.00..1.50 rows=1 width=
                     Filter: ((value ->> 'level'::text) = 'ERROR'::text)
                     Rows Removed by Filter: 2
 Planning time: 0.083 ms
 Execution time: 0.069 ms
(11 rows)

(Interesting part is the Nested Loop, sorting and unique is for the DISTINCT clause)

QUERY DIFFERENCES

Filter on a document value at 3 nesting levels:

client request timeouts (30 seconds xhr timeout)

Application-side
JSON
JSONB

 Planning time: 0.389 ms
 Execution time: 1682.812 ms

 Planning time: 0.860 ms
 Execution time: 237.822 ms

DEADLY QUERIES

select
    ad.*,
    ai.r as rn,
    m.l,
    m.a
from
    (select
        coalesce(l.h, atg.h) as h,
        coalesce(l.lCount, 0) as l,
        coalesce(atg.ac, 0) as ac
    from
        (select
            tgb.h,
            count(*) as ac
        from
            (select distinct
                 d.h,
                 baci[1] as bi,
                 baci[2] as ci
            from
                a_d d,
                jsonb_array_elements(d.dia->'cd') crc,
                regexp_matches(crc->>'in','something(\d)/(\d)' ) baci
            where
                d.dia->>'m' = 'BLA-X2'
                and crc->>'in' like 'something%'
                and crc->>'in' like '%(10)%') as tgb,
            (select
                ahw.h as h,
                b.key as bi,
                c.key as ci,
                count(*)
            from
                a_h ahw,
                jsonb_each(ahw.hw->'boards') b,
                jsonb_each(b.value->'cards') c,
                jsonb_each(c.value->'ports') p
            where
                ahw.hw->>'m' = 'BLA-X2'
                and jsonb_array_length(p.value->'s') > 0
            group by h, bi, ci) as ac
        where
            tgb.h = ac.h
            and tgb.bi = ac.bi
            and tgb.ci = ac.ci
        group by
            tgb.h) as atg
    full outer join
        (select
            d.h,
            sum(cast(i->>'value' as integer)) as lCount
        from a_d d, jsonb_array_elements(d.dia->'ld'->'items') i
        where
            d.dia->>'m' = 'BLA-X2'
            and i->>'in' = 'LSOMETHING'
            and i->>'value' SIMILAR TO '[0-9]+'
        group by
            d.h) as l


    on
        l.h = atg.h
    where
        l.lCount is null
        or atg.ac is null
        or l.lCount != atg.ac) as m
inner join a_d ad on ad.h = m.h
inner join a a on m.h = a.h
inner join (
    select
    string_agg(ring,', ') as rings,
        a as h
    from a_i_r ai
    group by a) as ai on m.h = ai.h;

UGLY CODE?

Use common sense, don't abuse
Hide/abstract complexity in the persistence layer, expose meaningful functions
(e.g: deadly query is FooRepository::listActiveFoosOverLimit())
use WITH to improve readability
refactor to schema when it becomes reasonable to do so!!

WITH somethings_count_by_h 
     as (select ...
            from
                a_d d,
                jsonb_array_elements(d.dia->'cd') crc,
                regexp_matches(crc->>'in','something(\d)/(\d)' ) baci
            where
                ...),
SELECT ...

JSON IN POSTGRESQL

CONTEXT

WHY MIX?

THEN

REFACTOR SCHEMAS?

DOWNSIDES?

POSTGRES 9.3 TO THE RESCUE

JSON QUERY 9.3

JSON QUERY 9.3

JSON QUERY 9.3

YAY! PARTY? NO.

POSTGRES 9.4 TO THE RESCUE

JSON QUERY 9.4

JSON QUERY 9.4

QUERY DIFFERENCES

DEADLY QUERIES

UGLY CODE?

THANKS