JSON IN POSTGRESQL

CONTEXT

  • Java + PostgreSQL stack
  • data persisted as a mix of structured & unstructured

WHY MIX?

  • need transaction/relational capabilities for business entities
  • some data only meant for visualization
  • and/or not relevant for business logic
  • and/or unsutructured/semistructured
  • and/or not yet clear how to use it (spike?)

Normalizing in tables and formalizing structure with a schema is not useful for this portion of data, and not worth the effort. Just shove it in a document-like column

THEN

For whatever reason, some of that data becomes relevant (e.g: new feature).

But it's still not a first-class entity.

REFACTOR SCHEMAS?

  • might be just part of a query criterion
    (e.g:all devices with log lines at error level)
  • still no clear structure
  • arbitrary fields, deep nested structures...
  • no direct manipulation (read/append only)
  • read & modify something else in same transaction

So it's still simpler to keep it as documents.

DOWNSIDES?

  • have to load in memory and deserialize the data
  • may be lot of documents (full table scan?)
  • may require lots of gc
  • lot of data moved from db to application without being used

Need query capabilities on the documents!

POSTGRES 9.3 TO THE RESCUE

JSON QUERY 9.3 

Data

-[ RECORD 1 ]----------------------------------------------------------------
id   | 1
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
        {"level": "ERROR", "text":"zot","timestamp":1425029952108},
        {"level": "WARN", "text":"warnever","timestamp":1425029952110}]
-[ RECORD 2 ]----------------------------------------------------------------
id   | 2
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
        {"level": "WARN", "text":"warnever","timestamp":1425029952110}]
-[ RECORD 3 ]----------------------------------------------------------------
id   | 3
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
        {"level": "WARN", "text":"warnever","timestamp":1425029952110}]

JSON QUERY 9.3 

"All devices with log lines at error level"

sample=# select distinct on (d.id)
sample-#   d.*
sample-# from devices d,
sample-#      json_array_elements(d.logs) l
sample-# where l->>'level' = 'ERROR';
 id |                                                                             
----+-----------------------------------------------------------------------------
  1 | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},{"l...

JSON QUERY 9.3 

Query plan

                            QUERY PLAN                              
-----------------------------------------------------------------------------------------
 Unique  (cost=1832.25..1838.05 rows=200 width=40) (actual time=0.090..0.091 rows=1 loops
   ->  Sort  (cost=1832.25..1835.15 rows=1160 width=40) (actual time=0.090..0.090 rows=1 
         Sort Key: d.id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.00..1773.20 rows=1160 width=40) (actual time=0.046..0.0
               ->  Seq Scan on devices d  (cost=0.00..21.60 rows=1160 width=40) (actual t
               ->  Function Scan on json_array_elements l  (cost=0.00..1.50 rows=1 width=
                     Filter: ((value ->> 'level'::text) = 'ERROR'::text)
                     Rows Removed by Filter: 2
 Planning time: 0.112 ms
 Execution time: 0.134 ms
(11 rows)

(Interesting part is the Nested Loop, sorting and unique is for the DISTINCT clause)

YAY! PARTY? NO.

Complex queries that spans multiple documents and/or deep nested structures might be very slow

  • JSON is actually a text+check type
    pg has to reparse it everytime to access
  • Have to full-table scan anyway

POSTGRES 9.4 TO THE RESCUE

  • JSONB type
    (binary representation, indexable)
  • performance improvements

JSON QUERY 9.4 

"All devices with log lines at error level"

sample=# select distinct on (d.id)
sample-#   d.*
sample-# from devices d,
sample-#      jsonb_array_elements(d.logs) l
sample-# where l->>'level' = 'ERROR';
 id |                                                                             
----+-----------------------------------------------------------------------------
  1 | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},{"l...

JSON QUERY 9.4 

Query plan

                            QUERY PLAN                              
-----------------------------------------------------------------------------------------
 Unique  (cost=1832.25..1838.05 rows=200 width=40) (actual time=0.090..0.091 rows=1 loops
   ->  Sort  (cost=1832.25..1835.15 rows=1160 width=40) (actual time=0.090..0.090 rows=1 
         Sort Key: d.id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.00..1773.20 rows=1160 width=40) (actual time=0.046..0.0
               ->  Seq Scan on devices d  (cost=0.00..21.60 rows=1160 width=40) (actual t
               ->  Function Scan on jsonb_array_elements l  (cost=0.00..1.50 rows=1 width=
                     Filter: ((value ->> 'level'::text) = 'ERROR'::text)
                     Rows Removed by Filter: 2
 Planning time: 0.083 ms
 Execution time: 0.069 ms
(11 rows)

(Interesting part is the Nested Loop, sorting and unique is for the DISTINCT clause)

QUERY DIFFERENCES

Filter on a document value at 3 nesting levels:

client request timeouts (30 seconds xhr timeout)
  • Application-side


     
  • JSON


     
  • JSONB
 Planning time: 0.389 ms
 Execution time: 1682.812 ms
 Planning time: 0.860 ms
 Execution time: 237.822 ms

DEADLY QUERIES

select
    ad.*,
    ai.r as rn,
    m.l,
    m.a
from
    (select
        coalesce(l.h, atg.h) as h,
        coalesce(l.lCount, 0) as l,
        coalesce(atg.ac, 0) as ac
    from
        (select
            tgb.h,
            count(*) as ac
        from
            (select distinct
                 d.h,
                 baci[1] as bi,
                 baci[2] as ci
            from
                a_d d,
                jsonb_array_elements(d.dia->'cd') crc,
                regexp_matches(crc->>'in','something(\d)/(\d)' ) baci
            where
                d.dia->>'m' = 'BLA-X2'
                and crc->>'in' like 'something%'
                and crc->>'in' like '%(10)%') as tgb,
            (select
                ahw.h as h,
                b.key as bi,
                c.key as ci,
                count(*)
            from
                a_h ahw,
                jsonb_each(ahw.hw->'boards') b,
                jsonb_each(b.value->'cards') c,
                jsonb_each(c.value->'ports') p
            where
                ahw.hw->>'m' = 'BLA-X2'
                and jsonb_array_length(p.value->'s') > 0
            group by h, bi, ci) as ac
        where
            tgb.h = ac.h
            and tgb.bi = ac.bi
            and tgb.ci = ac.ci
        group by
            tgb.h) as atg
    full outer join
        (select
            d.h,
            sum(cast(i->>'value' as integer)) as lCount
        from a_d d, jsonb_array_elements(d.dia->'ld'->'items') i
        where
            d.dia->>'m' = 'BLA-X2'
            and i->>'in' = 'LSOMETHING'
            and i->>'value' SIMILAR TO '[0-9]+'
        group by
            d.h) as l

    on
        l.h = atg.h
    where
        l.lCount is null
        or atg.ac is null
        or l.lCount != atg.ac) as m
inner join a_d ad on ad.h = m.h
inner join a a on m.h = a.h
inner join (
    select
    string_agg(ring,', ') as rings,
        a as h
    from a_i_r ai
    group by a) as ai on m.h = ai.h;

UGLY CODE?

  • Use common sense, don't abuse
  • Hide/abstract complexity in the persistence layer, expose meaningful functions
    (e.g: deadly query is FooRepository::listActiveFoosOverLimit())
  • use WITH to improve readability





     
  • refactor to schema when it becomes reasonable to do so!!
WITH somethings_count_by_h 
     as (select ...
            from
                a_d d,
                jsonb_array_elements(d.dia->'cd') crc,
                regexp_matches(crc->>'in','something(\d)/(\d)' ) baci
            where
                ...),
SELECT ...

THANKS

json in postgresql

By Tsukihara Caligin