Full Text Search using PostgreSQL

Full Text Search

Full Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query.

Cari yang mana?

SELECT *
FROM jkt48 t
WHERE t.name LIKE '%you%'

Number of records : 0

{
    "is_jomblo"  : TRUE,
    "message"    : "JOMBLO DETECTED!",
    "code"       : 911
}

response

  • ~ (tilde) 
  • ~* (tilde-asterisk)
  • LIKE
  • ILIKE

PostgreSQL has

Don't use this for Huge Data

Document

A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, these associations are used to search for documents that contain query words.

What Is a Document?

Example Query

SELECT title
FROM tweets
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');

Above query will also find related words such as friends and friendly, since all these are reduced to the same normalized lexeme.

SELECT title
FROM tweets
WHERE to_tsvector(body) @@ to_tsquery('friend');

Example Query

SELECT title
FROM tweets
WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table')
ORDER BY last_mod_date DESC
LIMIT 10;

Although these queries will work without an index, most applications will find this approach too slow, except perhaps for occasional ad-hoc searches. Practical use of text searching usually requires creating an index.

Index

PostgreSQL Index

  • B-Tree
  • Generalized Inverted Index (GIN)
  • Generalized Inverted Seach Tree (GiST)
  • Space partitioned GiST (SP-GiST)
  • Block Range Indexes (BRIN)
  • Hash

GIN

GIN is designed for handling cases where the items to be indexed are composite values

Example data types that suitable using GIN : hStore, Arrays,  JSONB

CREATE INDEX tweets_idx 
ON tweets 
USING GIN (to_tsvector(body));

Need Faster ?

Store your ts_vector value into new table or new column

ALTER TABLE tweets ADD COLUMN content_tokens tsvector;
UPDATE tweets 
SET content_tokens = to_tsvector(coalesce(title,'') || ' ' || coalesce(body,''));
CREATE INDEX textsearch_idx ON pgweb USING GIN (content_tokens);

Now, we are ready

SELECT title
FROM tweets
WHERE content_tokens @@ to_tsquery('create & table')
ORDER BY last_mod_date DESC
LIMIT 10;

Then we create a GIN index to speed up the search:

Limitation

  • The length of each lexeme must be less than 2K bytes

  • The length of a tsvector (lexemes + positions) must be less than 1 megabyte

  • The number of lexemes must be less than 264

  • Position values in tsvector must be greater than 0 and no more than 16,383

  • The match distance in a <N> (FOLLOWED BY) tsquery operator cannot be more than 16,384

  • No more than 256 positions per lexeme

  • The number of nodes (lexemes + operators) in a tsquery must be less than 32,768

Demo!

Fin

Made with Slides.com