PostgreSQL

is NOT your traditional

SQL database

Gülçin Yıldırım

May 2018, PHPKonf Istanbul

select * from me;

Lead of Cloud Development @ 2ndQuadrant

Board Member @ PostgreSQL Europe

MSc Comp. & Systems Eng. @ Tallinn University of Technology

Organizer @ Prague PostgreSQL Meetup

Writes on 2ndQuadrant blog

@kadinyazilimci

From Turkey

Lives in Prague

Agenda

  • Design choices of PostgreSQL

  • Arrays, Enum, JSON

  • JSONB and GIN 

  • Full Text Search in PostgreSQL

    • tsvector, tsquery

    • Ranking

    • Fuzzy search / Misspelling

    • Accent support

    • Language support

  • Why PostgreSQL?

Design Choices of PostgreSQL

  • Conventional Relational PostgreSQL
    • Tables, Columns, Rows, Query Processing
  • Object Relational PostgreSQL
    • Extensibility 
      • Rich type system
      • Wide variety of index types
  • Power of combining all
    • Following SQL standards
    • ACID properties

Arrays

  • Standard arrays
  • Array operators (@>, <@, &&, =, <> etc)
  • Search in the array
  • Process array elements from SQL directly  
  • Index them with GIN
    • This index access method allows PostgreSQL to index the contents of the arrays, rather than each array as an opaque value.

Arrays

                                              Table "public.film"
        Column        |           Type           | Collation | Nullable |                Default
----------------------+--------------------------+-----------+----------+---------------------------------------
 film_id              | integer                  |           | not null | nextval('film_film_id_seq'::regclass)
 title                | text                     |           | not null |
 description          | text                     |           |          |
 release_year         | year                     |           |          |
 language_id          | smallint                 |           | not null |
 original_language_id | smallint                 |           |          |
 rental_duration      | smallint                 |           | not null | 3
 rental_rate          | numeric(4,2)             |           | not null | 4.99
 length               | smallint                 |           |          |
 replacement_cost     | numeric(5,2)             |           | not null | 19.99
 rating               | mpaa_rating              |           |          | 'G'::mpaa_rating
 last_update          | timestamp with time zone |           | not null | now()
 special_features     | text[]                   |           |          |
 fulltext             | tsvector                 |           | not null |

Arrays

fts_demo=> Select film_id, special_features from film 
           where special_features @> array['Deleted Scenes'] limit 15;
 film_id |                  special_features
---------+-----------------------------------------------------
       1 | {"Deleted Scenes","Behind the Scenes"}
       2 | {Trailers,"Deleted Scenes"}
       3 | {Trailers,"Deleted Scenes"}
       5 | {"Deleted Scenes"}
       6 | {"Deleted Scenes"}
       7 | {Trailers,"Deleted Scenes"}
       9 | {Trailers,"Deleted Scenes"}
      10 | {Trailers,"Deleted Scenes"}
      12 | {Commentaries,"Deleted Scenes"}
      13 | {"Deleted Scenes","Behind the Scenes"}
      14 | {Trailers,"Deleted Scenes","Behind the Scenes"}
      19 | {Commentaries,"Deleted Scenes","Behind the Scenes"}
      20 | {Commentaries,"Deleted Scenes","Behind the Scenes"}
      23 | {Trailers,"Deleted Scenes"}
      26 | {Commentaries,"Deleted Scenes"}
(15 rows)

Arrays

fts_demo=> CREATE INDEX idx_sp_features ON film USING GIN(special_features);
CREATE INDEX

fts_demo=> Explain analyze (Select * from film 
                            where special_features @> array['Deleted Scenes']);
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on film  (cost=11.90..73.19 rows=503 width=386) (actual time=0.058..0.187 rows=503 loops=1)
   Recheck Cond: (special_features @> '{"Deleted Scenes"}'::text[])
   Heap Blocks: exact=55
   ->  Bitmap Index Scan on idx_sp_features  (cost=0.00..11.77 rows=503 width=0) (actual time=0.046..0.046 rows=503 loops=1)
         Index Cond: (special_features @> '{"Deleted Scenes"}'::text[])
 Planning time: 0.512 ms
 Execution time: 0.267 ms
(7 rows)

Enum

  • Lookup table
  • Stores integer instead of whole value in table
  • Denormalized, you don't need a separate table
  • Faster reads
  • Intended for static sets of values
  • Takes very little space, four bytes on disk
  • All of this is indexable! \o/

Enum

create type status as enum('backlog', 'in-progress', 'done', 'delivered');

create table issues
 (
   id bigint primary key,
   description   text,
   state status 
 );

insert into issues(id, description, state)
     values (1, 'Implement Job for Switching DNS API Call', 'backlog'),
            (2, 'Report an issue mechanism for customers', 'in-progress'),
            (3, 'Cost reports', 'done'),
            (4, 'Scheduled Jobs Mechanism', 'delivered');

fts_demo=> Select * from issues where state = 'in-progress';
 id |               description               |    state
----+-----------------------------------------+-------------
  2 | Report an issue mechanism for customers | in-progress
(1 row)

Enum

fts_demo=> set enable_seqscan = off;
SET

fts_demo=> create index idx_state on issues(state);
CREATE INDEX

fts_demo=> Explain analyze (Select * from issues where state = 'in-progress');
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_state on issues  (cost=0.13..8.15 rows=1 width=44) (actual time=0.007..0.008 rows=1 loops=1)
   Index Cond: (state = 'in-progress'::status)
 Planning time: 0.054 ms
 Execution time: 0.023 ms
(4 rows)

JSON

  • Validated as correct JSON
  • Stores as text
  • Keeps the same format as it sent
  • Useful if;
    • you want to store bunch of JSON (fast)
    • you don't need to search in JSON itself
  • Fast to write
    • you don't transform but only validate
  • More intensive to search
    • you obviously interpret it every time you access it

JSON

create table js(id serial primary key, extra json);
insert into js(extra)
     values ('[1, 2, 3, 4]'),
            ('[2, 3, 5, 8]'),
            ('{"key": "value"}');

fts_demo=> select * from js where extra @> '2';
ERROR:  operator does not exist: json @> unknown
LINE 1: select * from js where extra @> '2';
                                     ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts

alter table js alter column extra type jsonb;

fts_demo=> select * from js where extra @> '2';
 id |    extra
----+--------------
  1 | [1, 2, 3, 4]
  2 | [2, 3, 5, 8]
(2 rows)

JSONB

  • JSONB is already stored in (internal binary format) interpreted form. This means:

    • storing take a little while longer (more CPU process)

    • but processing (retrieval) faster

  • The main thing is all JSON document can be indexed with a single GIN index.

fts_demo=> create index on js using gin (extra jsonb_path_ops);
CREATE INDEX

JSONB

fts_demo=> explain analyze (select * from js where extra @> '2');
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on js  (cost=8.00..12.01 rows=1 width=36) (actual time=0.011..0.012 rows=2 loops=1)
   Recheck Cond: (extra @> '2'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on js_extra_idx  (cost=0.00..8.00 rows=1 width=0) (actual time=0.006..0.006 rows=2 loops=1)
         Index Cond: (extra @> '2'::jsonb)
 Planning time: 0.054 ms
 Execution time: 0.031 ms
(7 rows)

fts_demo=> explain analyze (select * from js where extra @> '[2,3]');
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on js  (cost=12.00..16.01 rows=1 width=36) (actual time=0.012..0.013 rows=2 loops=1)
   Recheck Cond: (extra @> '[2, 3]'::jsonb)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on js_extra_idx  (cost=0.00..12.00 rows=1 width=0) (actual time=0.007..0.007 rows=2 loops=1)
         Index Cond: (extra @> '[2, 3]'::jsonb)
 Planning time: 0.053 ms
 Execution time: 0.032 ms
(7 rows)

JSONB

  • Interpreted format is different than what you sent originally, it goes through normalisation:

    • keys are sorted

    • duplicated keys are removed and only first ones are saved

    • whitespaces removed etc.

  • Fits into JSON standard (JSONB is Postgres' JSON)

    • schemaless PostgreSQL 

    • heterogeneous set of documents all in a single relation

    • semi-structured data model

GIN

Generalised Inverted Index

Why?

forward indexes

list of documents and which words appear in them

  • there is almost no duplication 

backward (inverted) indexes

list of words and in which documents they appeared

  • it is efficient

  • duplicate data in values

  • the more duplication the more efficient indices

GIN

ID Document
1 PostgreSQL is awesome
2 Awesome things happen
3 PHP loves PostgreSQL
4 Istanbul is awesome too
5 Thanks PHP
Term Document ID
awesome 1, 2, 4
happen 2
is 1, 4
istanbul 4
loves 3
php 3, 5
postgresql 1, 3
thanks 5
things 2
too 4

inverted index simplified

posting list

key

(               ,           )

(               ,       )

GIN

  • GIN is an index that allows indexing of complex data types
    • Postgres data types extract keys and positions of them
    • Key is data type specific
    • In the case of JSON it can store of the paths of JSONB documents. This is its key.​
  • GIN is very efficient in duplicate keys (GIN keys)
    • Keys of JSON != Keys of GIN
  • GIN has more compact way of storing duplicate values (keys) than B Tree.

    • Which you get a lot in JSONB because you usually have similar structure in documents. Similar to full text engine’s indexing.

FTS in PostgreSQL

FTS in PostgreSQL

  • FTS is implemented in a similar fashion like JSONB type:
    • there are types like ts_vector which get text input and parses into lexemes
  • Difference between JSONB:
    • ts_vector only stores info that is useful for FTS while JSONB stores the actual document as well
    • that has affect on how it is used afterwards:
      • JSONB is used as column type while ts_vector is mostly used for creating indexes as index definition or compound values (indexing multiple columns at the same time)

tsvector

tsvector which is a type suited to full-text search

fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.');
                 to_tsvector
----------------------------------------------
 'allegori':4 'happi':1 'stori':7 'unhappi':5
(1 row)

fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.') 
           @@ 'happiness';
 ?column?
----------
 f
(1 row)

tsquery


fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.') 
           @@ to_tsquery('happiness');
 ?column?
----------
 t
(1 row)

fts_demo=# SELECT to_tsvector('Happiness is an allegory, unhappiness a story.') 
           @@ to_tsquery('happiness & unhappiness');
 ?column?
----------
 t
(1 row)

 tsquery stores lexemes that are to be searched for

Querying

Select title, description
from
    (select title, description, to_tsvector(title) || 
            to_tsvector(description) as searchterm
    from film) as q
where q.searchterm @@ to_tsquery('Human & Database')
limit 5;

      title      |                                                        description
-----------------+----------------------------------------------------------------------------------------------------------------------------
 ANONYMOUS HUMAN | A Amazing Reflection of a Database Administrator And a Astronaut who must Outrace a Database Administrator in A Shark Tank
 HUMAN GRAFFITI  | A Beautiful Reflection of a Womanizer And a Sumo Wrestler who must Chase a Database Administrator in The Gulf of Mexico
(2 rows)
Select title, ts_rank(q.searchterm, to_tsquery('DINOSAUR | Feminist')) as searchrank, description
from
    (select title, description, setweight(to_tsvector(title), 'A') || 
            setweight(to_tsvector(description), 'B') as searchterm
    from film) as q
where q.searchterm @@ to_tsquery('DINOSAUR | Feminist')
order by searchrank desc
limit 5;


       title        | searchrank |                                            description
--------------------+------------+----------------------------------------------------------------------------------------------------
 ACADEMY DINOSAUR   |   0.425549 | A Epic Drama of a Feminist And a Mad Scientist who must Battle a Teacher in The Canadian Rockies
 DINOSAUR SECRETARY |   0.425549 | A Action-Packed Drama of a Feminist And a Girl who must Reach a Robot in The Canadian Rockies
 CENTER DINOSAUR    |   0.303964 | A Beautiful Character Study of a Sumo Wrestler And a Dentist who must Find a Dog in California
 SPY MILE           |   0.165491 | A Thrilling Documentary of a Feminist And a Feminist who must Confront a Feminist in A Baloon
 BUNCH MINDS        |   0.151982 | A Emotional Story of a Feminist And a Feminist who must Escape a Pastry Chef in A MySQL Convention
(5 rows)

Ranking

1x

1x

0

3x

2x

Fuzzy Search Using Trigrams

Trigram?

"h"

"he"

"hel"

"ell"

"llo"

"lo"

"o"

hello

hallo

"h"

"ha"

 "hal"

"all"

"llo"

"lo"

"o"

fts_demo=# Create extension pg_trgm;
CREATE EXTENSION

fts_demo=# select similarity('hello','hallo');
 similarity
------------
   0.333333
(1 row)

Similarity and Distance

%,<%, <->
fts_demo=# explain analyze select description from film 
           where description %> 'Feminist';
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on film  (cost=10000000000.00..10000000067.50 rows=1 width=94) (actual time=0.031..14.900 rows=84 loops=1)
   Filter: (description %> 'Feminist'::text)
   Rows Removed by Filter: 916
 Planning time: 0.046 ms
 Execution time: 14.919 ms

fts_demo=# CREATE INDEX trgm_idx ON film USING GIN (description gin_trgm_ops);
CREATE INDEX

fts_demo=# explain analyze select description from film 
           where description %> 'Feminist';
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on film  (cost=76.01..80.02 rows=1 width=94) (actual time=0.113..1.945 rows=84 loops=1)
   Recheck Cond: (description %> 'Feminist'::text)
   Rows Removed by Index Recheck: 29
   Heap Blocks: exact=49
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..76.01 rows=1 width=0) (actual time=0.085..0.085 rows=113 loops=1)
         Index Cond: (description %> 'Feminist'::text)
 Planning time: 0.132 ms
 Execution time: 1.970 ms

Like Queries

LIKE, ILIKE, ~, ~*
fts_demo=# Explain analyze select description from film  
           where description like '%Feminist%';
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on film  (cost=52.63..111.30 rows=81 width=94) (actual time=0.052..0.112 rows=84 loops=1)
   Recheck Cond: (description ~~ '%Feminist%'::text)
   Heap Blocks: exact=42
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..52.61 rows=81 width=0) (actual time=0.044..0.044 rows=84 loops=1)
         Index Cond: (description ~~ '%Feminist%'::text)
 Planning time: 0.108 ms
 Execution time: 0.135 ms
(7 rows)

Misspelling

fts_demo=# CREATE TABLE unique_lexeme AS 
           SELECT word FROM ts_stat(
           'SELECT to_tsvector(''simple'', first_name) || 
               to_tsvector(''simple'', last_name)
           FROM actor 
           GROUP BY actor_id');

fts_demo=# CREATE INDEX lexeme_idx ON unique_lexeme USING GIN (word gin_trgm_ops);
CREATE INDEX

fts_demo=# SELECT word from unique_lexeme
           WHERE similarity(word, 'sinatro') > 0.5
           ORDER BY word <-> 'sinatro'
           LIMIT 10;
  word
---------
 sinatra
(1 row)

Multilingual PostgreSQL

Built-in text search for Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

Accent Support

CREATE EXTENSION unaccent;

SELECT unaccent('Gülçin Yıldırım Jelínek');
        unaccent         
-------------------------
 Gulcin Yildirim Jelinek
(1 row)

fts_demo=# CREATE TEXT SEARCH CONFIGURATION tr ( COPY = turkish );
CREATE TEXT SEARCH CONFIGURATION
fts_demo=# ALTER TEXT SEARCH CONFIGURATION tr 
           ALTER MAPPING FOR hword, hword_part, word WITH unaccent, turkish_stem;
ALTER TEXT SEARCH CONFIGURATION

fts_demo=# SELECT to_tsvector('tr', 'Gülçin') @@ to_tsquery('tr', 'gulcin') as result;
 result 
--------
 t
(1 row)

fts_demo=# set default_text_search_config to 'tr';
SET
fts_demo=# SELECT to_tsvector('Gülçin') @@ to_tsquery('gulcin') as result;
 result 
--------
 t
(1 row)

PostGIS

Geospatial search in PostgreSQL? Yes, ofc!

Why PostgreSQL?

Advantages of PostgreSQL over using a search engine:

  • You can use the existing relations
  • You can query related information (joins)
  • You can do all in one query (transactional)
  • When you update (insert, delete) your document, indexes are updated automatically
    • Rebuilding indexes are not a concern
    • FTS is always up-to-date (no 404)
  • Same ACID properties
  • You don’t need to maintain two techs (two dataset)

Why PostgreSQL?

JSONB

  • Stable schema and flexibly evolving data in the same database
  • Denormalisation without the downsides
    • No unnecesary tables
    • No unnecessary joins
fts_demo=# Select first_name, last_name, education from staff;
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
first_name | Mike
last_name  | Hillyer
education  | {"properties": {"university": {"type": "oxford"}, "high school": {"name": "harvard business school"}}, "dependencies": {"graduation-date": ["2017-11-10"]}}
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
first_name | Jon
last_name  | Stephens
education  | {"properties": {"university": {"type": "tallinn university of technology"}, "high school": {"name": "business school"}}, "dependencies": {"graduation-date": ["2017-10-23"]}}

References

Thank you! Questions?

MySQL

MySQL full-text search features are very limited. By default, there is no support for stemming nor any language support. I came across a stemming function which can be installed, but MYSQL doesn't support function based indexes. (http://rachbelaid.com/postgres-full-text-search-is-good-enough/)

 

  • Migrate to PostgreSQL
  • Investigate a search engine like SOLR and Elasticsearch
Made with Slides.com