Gülçin Yıldırım Jelínek
Staff Database Engineer @Xata, Main Organizer @Prague PostgreSQL Meetup, MSc, Computer and Systems Engineering @ Tallinn University of Technology, BSc, Applied Mathematics @Yildiz Technical University
May 2018, PHPKonf Istanbul
Lead of Cloud Development @ 2ndQuadrant
Board Member @ PostgreSQL Europe
MSc Comp. & Systems Eng. @ Tallinn University of Technology
Organizer @ Prague PostgreSQL Meetup
Writes on 2ndQuadrant blog
From Turkey
Lives in Prague
Github: gulcin
Design choices of PostgreSQL
Arrays, Enum, JSON
JSONB and GIN
Full Text Search in PostgreSQL
tsvector, tsquery
Ranking
Fuzzy search / Misspelling
Accent support
Language support
Why PostgreSQL?
Table "public.film"
Column | Type | Collation | Nullable | Default
----------------------+--------------------------+-----------+----------+---------------------------------------
film_id | integer | | not null | nextval('film_film_id_seq'::regclass)
title | text | | not null |
description | text | | |
release_year | year | | |
language_id | smallint | | not null |
original_language_id | smallint | | |
rental_duration | smallint | | not null | 3
rental_rate | numeric(4,2) | | not null | 4.99
length | smallint | | |
replacement_cost | numeric(5,2) | | not null | 19.99
rating | mpaa_rating | | | 'G'::mpaa_rating
last_update | timestamp with time zone | | not null | now()
special_features | text[] | | |
fulltext | tsvector | | not null |
fts_demo=> Select film_id, special_features from film
where special_features @> array['Deleted Scenes'] limit 15;
film_id | special_features
---------+-----------------------------------------------------
1 | {"Deleted Scenes","Behind the Scenes"}
2 | {Trailers,"Deleted Scenes"}
3 | {Trailers,"Deleted Scenes"}
5 | {"Deleted Scenes"}
6 | {"Deleted Scenes"}
7 | {Trailers,"Deleted Scenes"}
9 | {Trailers,"Deleted Scenes"}
10 | {Trailers,"Deleted Scenes"}
12 | {Commentaries,"Deleted Scenes"}
13 | {"Deleted Scenes","Behind the Scenes"}
14 | {Trailers,"Deleted Scenes","Behind the Scenes"}
19 | {Commentaries,"Deleted Scenes","Behind the Scenes"}
20 | {Commentaries,"Deleted Scenes","Behind the Scenes"}
23 | {Trailers,"Deleted Scenes"}
26 | {Commentaries,"Deleted Scenes"}
(15 rows)
fts_demo=> CREATE INDEX idx_sp_features ON film USING GIN(special_features);
CREATE INDEX
fts_demo=> Explain analyze (Select * from film
where special_features @> array['Deleted Scenes']);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on film (cost=11.90..73.19 rows=503 width=386) (actual time=0.058..0.187 rows=503 loops=1)
Recheck Cond: (special_features @> '{"Deleted Scenes"}'::text[])
Heap Blocks: exact=55
-> Bitmap Index Scan on idx_sp_features (cost=0.00..11.77 rows=503 width=0) (actual time=0.046..0.046 rows=503 loops=1)
Index Cond: (special_features @> '{"Deleted Scenes"}'::text[])
Planning time: 0.512 ms
Execution time: 0.267 ms
(7 rows)
create type status as enum('backlog', 'in-progress', 'done', 'delivered');
create table issues
(
id bigint primary key,
description text,
state status
);
insert into issues(id, description, state)
values (1, 'Implement Job for Switching DNS API Call', 'backlog'),
(2, 'Report an issue mechanism for customers', 'in-progress'),
(3, 'Cost reports', 'done'),
(4, 'Scheduled Jobs Mechanism', 'delivered');
fts_demo=> Select * from issues where state = 'in-progress';
id | description | state
----+-----------------------------------------+-------------
2 | Report an issue mechanism for customers | in-progress
(1 row)
fts_demo=> set enable_seqscan = off;
SET
fts_demo=> create index idx_state on issues(state);
CREATE INDEX
fts_demo=> Explain analyze (Select * from issues where state = 'in-progress');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Index Scan using idx_state on issues (cost=0.13..8.15 rows=1 width=44) (actual time=0.007..0.008 rows=1 loops=1)
Index Cond: (state = 'in-progress'::status)
Planning time: 0.054 ms
Execution time: 0.023 ms
(4 rows)
create table js(id serial primary key, extra json);
insert into js(extra)
values ('[1, 2, 3, 4]'),
('[2, 3, 5, 8]'),
('{"key": "value"}');
fts_demo=> select * from js where extra @> '2';
ERROR: operator does not exist: json @> unknown
LINE 1: select * from js where extra @> '2';
^
HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts
alter table js alter column extra type jsonb;
fts_demo=> select * from js where extra @> '2';
id | extra
----+--------------
1 | [1, 2, 3, 4]
2 | [2, 3, 5, 8]
(2 rows)
JSONB is already stored in (internal binary format) interpreted form. This means:
storing take a little while longer (more CPU process)
but processing (retrieval) faster
The main thing is all JSON document can be indexed with a single GIN index.
fts_demo=> create index on js using gin (extra jsonb_path_ops);
CREATE INDEX
fts_demo=> explain analyze (select * from js where extra @> '2');
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on js (cost=8.00..12.01 rows=1 width=36) (actual time=0.011..0.012 rows=2 loops=1)
Recheck Cond: (extra @> '2'::jsonb)
Heap Blocks: exact=1
-> Bitmap Index Scan on js_extra_idx (cost=0.00..8.00 rows=1 width=0) (actual time=0.006..0.006 rows=2 loops=1)
Index Cond: (extra @> '2'::jsonb)
Planning time: 0.054 ms
Execution time: 0.031 ms
(7 rows)
fts_demo=> explain analyze (select * from js where extra @> '[2,3]');
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on js (cost=12.00..16.01 rows=1 width=36) (actual time=0.012..0.013 rows=2 loops=1)
Recheck Cond: (extra @> '[2, 3]'::jsonb)
Heap Blocks: exact=1
-> Bitmap Index Scan on js_extra_idx (cost=0.00..12.00 rows=1 width=0) (actual time=0.007..0.007 rows=2 loops=1)
Index Cond: (extra @> '[2, 3]'::jsonb)
Planning time: 0.053 ms
Execution time: 0.032 ms
(7 rows)
Interpreted format is different than what you sent originally, it goes through normalisation:
keys are sorted
duplicated keys are removed and only first ones are saved
whitespaces removed etc.
Fits into JSON standard (JSONB is Postgres' JSON)
schemaless PostgreSQL
heterogeneous set of documents all in a single relation
semi-structured data model
Generalised Inverted Index
Why?
forward indexes
list of documents and which words appear in them
there is almost no duplication
backward (inverted) indexes
list of words and in which documents they appeared
it is efficient
duplicate data in values
the more duplication the more efficient indices
ID | Document |
---|---|
1 | PostgreSQL is awesome |
2 | Awesome things happen |
3 | PHP loves PostgreSQL |
4 | Istanbul is awesome too |
5 | Thanks PHP |
Term | Document ID |
---|---|
awesome | 1, 2, 4 |
happen | 2 |
is | 1, 4 |
istanbul | 4 |
loves | 3 |
php | 3, 5 |
postgresql | 1, 3 |
thanks | 5 |
things | 2 |
too | 4 |
inverted index simplified
posting list
key
( , )
( , )
GIN has more compact way of storing duplicate values (keys) than B Tree.
Which you get a lot in JSONB because you usually have similar structure in documents. Similar to full text engine’s indexing.
tsvector which is a type suited to full-text search
fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.');
to_tsvector
----------------------------------------------
'allegori':4 'happi':1 'stori':7 'unhappi':5
(1 row)
fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.')
@@ 'happiness';
?column?
----------
f
(1 row)
fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.')
@@ to_tsquery('happiness');
?column?
----------
t
(1 row)
fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.')
@@ to_tsquery('happiness & unhappiness');
?column?
----------
t
(1 row)
tsquery stores lexemes that are to be searched for
Select title, description
from
(select title, description, to_tsvector(title) ||
to_tsvector(description) as searchterm
from film) as q
where q.searchterm @@ to_tsquery('Human & Database')
limit 5;
title | description
-----------------+----------------------------------------------------------------------------------------------------------------------------
ANONYMOUS HUMAN | A Amazing Reflection of a Database Administrator And a Astronaut who must Outrace a Database Administrator in A Shark Tank
HUMAN GRAFFITI | A Beautiful Reflection of a Womanizer And a Sumo Wrestler who must Chase a Database Administrator in The Gulf of Mexico
(2 rows)
Select title, ts_rank(q.searchterm, to_tsquery('DINOSAUR | Feminist')) as searchrank, description
from
(select title, description, setweight(to_tsvector(title), 'A') ||
setweight(to_tsvector(description), 'B') as searchterm
from film) as q
where q.searchterm @@ to_tsquery('DINOSAUR | Feminist')
order by searchrank desc
limit 5;
title | searchrank | description
--------------------+------------+----------------------------------------------------------------------------------------------------
ACADEMY DINOSAUR | 0.425549 | A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
DINOSAUR SECRETARY | 0.425549 | A Action-Packed Drama of a Feminist And a Girl who must Reach a Robot in The Canadian Rockies
CENTER DINOSAUR | 0.303964 | A Beautiful Character Study of a Sumo Wrestler And a Dentist who must Find a Dog in California
SPY MILE | 0.165491 | A Thrilling Documentary of a Feminist And a Feminist who must Confront a Feminist in A Baloon
BUNCH MINDS | 0.151982 | A Emotional Story of a Feminist And a Feminist who must Escape a Pastry Chef in A MySQL Convention
(5 rows)
1x
1x
0
3x
2x
Trigram?
"h"
"he"
"hel"
"ell"
"llo"
"lo"
"o"
hello
hallo
"h"
"ha"
"hal"
"all"
"llo"
"lo"
"o"
fts_demo=# Create extension pg_trgm;
CREATE EXTENSION
fts_demo=# select similarity('hello','hallo');
similarity
------------
0.333333
(1 row)
%,<%, <->
fts_demo=# explain analyze select description from film
where description %> 'Feminist';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on film (cost=10000000000.00..10000000067.50 rows=1 width=94) (actual time=0.031..14.900 rows=84 loops=1)
Filter: (description %> 'Feminist'::text)
Rows Removed by Filter: 916
Planning time: 0.046 ms
Execution time: 14.919 ms
fts_demo=# CREATE INDEX trgm_idx ON film USING GIN (description gin_trgm_ops);
CREATE INDEX
fts_demo=# explain analyze select description from film
where description %> 'Feminist';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on film (cost=76.01..80.02 rows=1 width=94) (actual time=0.113..1.945 rows=84 loops=1)
Recheck Cond: (description %> 'Feminist'::text)
Rows Removed by Index Recheck: 29
Heap Blocks: exact=49
-> Bitmap Index Scan on trgm_idx (cost=0.00..76.01 rows=1 width=0) (actual time=0.085..0.085 rows=113 loops=1)
Index Cond: (description %> 'Feminist'::text)
Planning time: 0.132 ms
Execution time: 1.970 ms
LIKE, ILIKE, ~, ~*
fts_demo=# Explain analyze select description from film
where description like '%Feminist%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on film (cost=52.63..111.30 rows=81 width=94) (actual time=0.052..0.112 rows=84 loops=1)
Recheck Cond: (description ~~ '%Feminist%'::text)
Heap Blocks: exact=42
-> Bitmap Index Scan on trgm_idx (cost=0.00..52.61 rows=81 width=0) (actual time=0.044..0.044 rows=84 loops=1)
Index Cond: (description ~~ '%Feminist%'::text)
Planning time: 0.108 ms
Execution time: 0.135 ms
(7 rows)
fts_demo=# CREATE TABLE unique_lexeme AS
SELECT word FROM ts_stat(
'SELECT to_tsvector(''simple'', first_name) ||
to_tsvector(''simple'', last_name)
FROM actor
GROUP BY actor_id');
fts_demo=# CREATE INDEX lexeme_idx ON unique_lexeme USING GIN (word gin_trgm_ops);
CREATE INDEX
fts_demo=# SELECT word from unique_lexeme
WHERE similarity(word, 'sinatro') > 0.5
ORDER BY word <-> 'sinatro'
LIMIT 10;
word
---------
sinatra
(1 row)
Built-in text search for Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
CREATE EXTENSION unaccent;
SELECT unaccent('Gülçin Yıldırım Jelínek');
unaccent
-------------------------
Gulcin Yildirim Jelinek
(1 row)
fts_demo=# CREATE TEXT SEARCH CONFIGURATION tr ( COPY = turkish );
CREATE TEXT SEARCH CONFIGURATION
fts_demo=# ALTER TEXT SEARCH CONFIGURATION tr
ALTER MAPPING FOR hword, hword_part, word WITH unaccent, turkish_stem;
ALTER TEXT SEARCH CONFIGURATION
fts_demo=# SELECT to_tsvector('tr', 'Gülçin') @@ to_tsquery('tr', 'gulcin') as result;
result
--------
t
(1 row)
fts_demo=# set default_text_search_config to 'tr';
SET
fts_demo=# SELECT to_tsvector('Gülçin') @@ to_tsquery('gulcin') as result;
result
--------
t
(1 row)
Geospatial search in PostgreSQL? Yes, ofc!
Advantages of PostgreSQL over using a search engine:
JSONB
fts_demo=# Select first_name, last_name, education from staff;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
first_name | Mike
last_name | Hillyer
education | {"properties": {"university": {"type": "oxford"}, "high school": {"name": "harvard business school"}}, "dependencies": {"graduation-date": ["2017-11-10"]}}
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
first_name | Jon
last_name | Stephens
education | {"properties": {"university": {"type": "tallinn university of technology"}, "high school": {"name": "business school"}}, "dependencies": {"graduation-date": ["2017-10-23"]}}
MySQL full-text search features are very limited. By default, there is no support for stemming nor any language support. I came across a stemming function which can be installed, but MYSQL doesn't support function based indexes. (http://rachbelaid.com/postgres-full-text-search-is-good-enough/)
By Gülçin Yıldırım Jelínek
This presentation is prepared for PHPKonf 2018 in Istanbul.
Staff Database Engineer @Xata, Main Organizer @Prague PostgreSQL Meetup, MSc, Computer and Systems Engineering @ Tallinn University of Technology, BSc, Applied Mathematics @Yildiz Technical University