JSON IN POSTGRESQL
CONTEXT
- Java + PostgreSQL stack
- data persisted as a mix of structured & unstructured
WHY MIX?
- need transaction/relational capabilities for business entities
- some data only meant for visualization
- and/or not relevant for business logic
- and/or unsutructured/semistructured
- and/or not yet clear how to use it (spike?)
Normalizing in tables and formalizing structure with a schema is not useful for this portion of data, and not worth the effort. Just shove it in a document-like column
THEN
For whatever reason, some of that data becomes relevant (e.g: new feature).
But it's still not a first-class entity.
REFACTOR SCHEMAS?
- might be just part of a query criterion
(e.g:all devices with log lines at error level) - still no clear structure
- arbitrary fields, deep nested structures...
- no direct manipulation (read/append only)
- read & modify something else in same transaction
So it's still simpler to keep it as documents.
DOWNSIDES?
- have to load in memory and deserialize the data
- may be lot of documents (full table scan?)
- may require lots of gc
- lot of data moved from db to application without being used
Need query capabilities on the documents!
POSTGRES 9.3 TO THE RESCUE
-
JSON type
(text field + automatic validity check) - ad-hoc operators and functions
(objects/array navigation, expansion)
JSON QUERY 9.3
Data
-[ RECORD 1 ]----------------------------------------------------------------
id | 1
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
{"level": "ERROR", "text":"zot","timestamp":1425029952108},
{"level": "WARN", "text":"warnever","timestamp":1425029952110}]
-[ RECORD 2 ]----------------------------------------------------------------
id | 2
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
{"level": "WARN", "text":"warnever","timestamp":1425029952110}]
-[ RECORD 3 ]----------------------------------------------------------------
id | 3
logs | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},
{"level": "WARN", "text":"warnever","timestamp":1425029952110}]
JSON QUERY 9.3
"All devices with log lines at error level"
sample=# select distinct on (d.id)
sample-# d.*
sample-# from devices d,
sample-# json_array_elements(d.logs) l
sample-# where l->>'level' = 'ERROR';
id |
----+-----------------------------------------------------------------------------
1 | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},{"l...
JSON QUERY 9.3
Query plan
QUERY PLAN
-----------------------------------------------------------------------------------------
Unique (cost=1832.25..1838.05 rows=200 width=40) (actual time=0.090..0.091 rows=1 loops
-> Sort (cost=1832.25..1835.15 rows=1160 width=40) (actual time=0.090..0.090 rows=1
Sort Key: d.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.00..1773.20 rows=1160 width=40) (actual time=0.046..0.0
-> Seq Scan on devices d (cost=0.00..21.60 rows=1160 width=40) (actual t
-> Function Scan on json_array_elements l (cost=0.00..1.50 rows=1 width=
Filter: ((value ->> 'level'::text) = 'ERROR'::text)
Rows Removed by Filter: 2
Planning time: 0.112 ms
Execution time: 0.134 ms
(11 rows)
(Interesting part is the Nested Loop, sorting and unique is for the DISTINCT clause)
YAY! PARTY? NO.
Complex queries that spans multiple documents and/or deep nested structures might be very slow
- JSON is actually a text+check type
pg has to reparse it everytime to access - Have to full-table scan anyway
POSTGRES 9.4 TO THE RESCUE
-
JSONB type
(binary representation, indexable) - performance improvements
JSON QUERY 9.4
"All devices with log lines at error level"
sample=# select distinct on (d.id)
sample-# d.*
sample-# from devices d,
sample-# jsonb_array_elements(d.logs) l
sample-# where l->>'level' = 'ERROR';
id |
----+-----------------------------------------------------------------------------
1 | [{"level": "INFO", "text":"config changed","timestamp":1425029952107},{"l...
JSON QUERY 9.4
Query plan
QUERY PLAN
-----------------------------------------------------------------------------------------
Unique (cost=1832.25..1838.05 rows=200 width=40) (actual time=0.090..0.091 rows=1 loops
-> Sort (cost=1832.25..1835.15 rows=1160 width=40) (actual time=0.090..0.090 rows=1
Sort Key: d.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.00..1773.20 rows=1160 width=40) (actual time=0.046..0.0
-> Seq Scan on devices d (cost=0.00..21.60 rows=1160 width=40) (actual t
-> Function Scan on jsonb_array_elements l (cost=0.00..1.50 rows=1 width=
Filter: ((value ->> 'level'::text) = 'ERROR'::text)
Rows Removed by Filter: 2
Planning time: 0.083 ms
Execution time: 0.069 ms
(11 rows)
(Interesting part is the Nested Loop, sorting and unique is for the DISTINCT clause)
QUERY DIFFERENCES
Filter on a document value at 3 nesting levels:
client request timeouts (30 seconds xhr timeout)
- Application-side
- JSON
- JSONB
Planning time: 0.389 ms
Execution time: 1682.812 ms
Planning time: 0.860 ms
Execution time: 237.822 ms
DEADLY QUERIES
select
ad.*,
ai.r as rn,
m.l,
m.a
from
(select
coalesce(l.h, atg.h) as h,
coalesce(l.lCount, 0) as l,
coalesce(atg.ac, 0) as ac
from
(select
tgb.h,
count(*) as ac
from
(select distinct
d.h,
baci[1] as bi,
baci[2] as ci
from
a_d d,
jsonb_array_elements(d.dia->'cd') crc,
regexp_matches(crc->>'in','something(\d)/(\d)' ) baci
where
d.dia->>'m' = 'BLA-X2'
and crc->>'in' like 'something%'
and crc->>'in' like '%(10)%') as tgb,
(select
ahw.h as h,
b.key as bi,
c.key as ci,
count(*)
from
a_h ahw,
jsonb_each(ahw.hw->'boards') b,
jsonb_each(b.value->'cards') c,
jsonb_each(c.value->'ports') p
where
ahw.hw->>'m' = 'BLA-X2'
and jsonb_array_length(p.value->'s') > 0
group by h, bi, ci) as ac
where
tgb.h = ac.h
and tgb.bi = ac.bi
and tgb.ci = ac.ci
group by
tgb.h) as atg
full outer join
(select
d.h,
sum(cast(i->>'value' as integer)) as lCount
from a_d d, jsonb_array_elements(d.dia->'ld'->'items') i
where
d.dia->>'m' = 'BLA-X2'
and i->>'in' = 'LSOMETHING'
and i->>'value' SIMILAR TO '[0-9]+'
group by
d.h) as l
on
l.h = atg.h
where
l.lCount is null
or atg.ac is null
or l.lCount != atg.ac) as m
inner join a_d ad on ad.h = m.h
inner join a a on m.h = a.h
inner join (
select
string_agg(ring,', ') as rings,
a as h
from a_i_r ai
group by a) as ai on m.h = ai.h;
UGLY CODE?
- Use common sense, don't abuse
- Hide/abstract complexity in the persistence layer, expose meaningful functions
(e.g: deadly query is FooRepository::listActiveFoosOverLimit()) - use WITH to improve readability
- refactor to schema when it becomes reasonable to do so!!
WITH somethings_count_by_h
as (select ...
from
a_d d,
jsonb_array_elements(d.dia->'cd') crc,
regexp_matches(crc->>'in','something(\d)/(\d)' ) baci
where
...),
SELECT ...
THANKS
json in postgresql
By Tsukihara Caligin
json in postgresql
- 2,068