Un moteur de recherche avec PostgreSQL

Pierre-Louis Le Portz

Développeur chez BAM

github: pleportz // twitter: @leportzp

Un champ texte
Un ou plusieurs mots clés recherchés
Classement par pertinence

"tarte"

SELECT ...

recettes

Recherche textuelle

1. Moteur de recherche en PostgreSQL

2. Performance

3. Solutions alternatives

Approche naïve avec ILIKE

'Délicieuse tarte' ILIKE '%TARTE%'

Ce n'est pas robuste si l'utilisateur emploie :

le pluriel
une forme dérivée du mot

De plus c'est LENT !

'Tarte à la tomate' ILIKE '%tarte%'

'Délicieuse tarte' ILIKE '%tartes%'

'Tarte préparée avec amour' ILIKE '%amoureusement%'

SELECT * FROM recipe WHERE title ILIKE '%Tarte%';

true

false

true

false

to_tsvector

to_tsquery

@@

ts_rank

Full-Text Search de PostgreSQL

to_tsquery

'Tarte & à & la & tomate'

'tart' & 'tomat'

to_tsquery('french', X)

'Tarte | à | la | tomate'

'tart' | 'tomat'

Transforme une chaîne de caractère en une text search query.

On s'affranchit :

de la casse,
des mots inutiles,
des pluriels

Plus généralement, on peut matcher 2 mots qui ont la même racine.

'Tarte à la tomate'

.replace(/ /g, ' & ')

to_tsvector

'Amoureusement'

Transforme une chaîne de caractère en un text search vector.

Même comportement que to_tsquery concernant la casse, la racine des mots et les mots inutiles.

'Tarte à la tomate'

'tart':1 'tomat':4

'amour':1

'avec amour'

'amour':2

to_tsvector('french', 'Tarte à la tomate')

@@

l'opérateur de correspondance

to_tsquery('french', 'Tarte & poireau') @@ to_tsvector('french', 'Tarte à la tomate')

'tart':1 'tomat':4

to_tsquery('french', 'Tarte') @@ to_tsvector('french', 'Tarte à la tomate')

Permet de savoir si une text search query correspond à un text search vector
On peut utiliser des opérateurs comme & (AND), | (OR) et ! (NOT) dans la query

true

false

true

to_tsquery('french, 'Tarte | poireau') @@ to_tsvector('french', 'Tarte à la tomate')

'tart'

to_tsvector

to_tsquery

@@

SELECT title FROM recipe 
WHERE to_tsquery('french', 'Tarte & à & la & tomate')
      @@
      to_tsvector('french', title);

ts_query @@ ts_vector

ts_rank

Compare une query et un vecteur pour renvoyer un indice de pertinence
Dépend :
- de la fréquence de la query dans le vector
- des poids (A, B, C, ou D) attribués aux racines dans le vecteur

SELECT
    ts_rank(
        to_tsvector('french', 'Tarte à la tomate'),
        to_tsquery('french', 'Tarte')
    );

= 0.061

SELECT
    ts_rank(
        to_tsvector('french', 'Tarte à la tomate'),
        to_tsquery('french', 'Tarte & à & la & tomate')
    );

= 0.097

SELECT
    ts_rank(
        setweight(to_tsvector('french', 'Tarte à la tomate'), 'A'),
        to_tsquery('french', 'Tarte & à & la & tomate')
    );

= 0.934

En assemblant tout ...

SELECT * FROM recipe

WHERE
    to_tsquery('french', 'Tarte & à & la & tomate') @@ to_tsvector('french', title)

ORDER BY
    ts_rank(to_tsvector('french', title), to_tsquery('french', 'Tarte & à & la & tomate')) DESC;

to_tsvector

to_tsquery

@@

ts_rank

Performance

Table "movie"

title ~3 mots
overview ~ 40 mots

10.000 films

Exemple

title : "The Lion King"
overview : "A young lion cub named Simba can't wait to be king. But his uncle craves the title for himself and will stop at nothing to get it."

On veut la liste des films dont le titre ou le résumé contient "eat"

Performance du ILIKE

SELECT title FROM movie

WHERE title ILIKE '%eat%' OR overview ILIKE '%eat%';

Plein de résultats qui n'ont rien à voir avec "manger" et de plus :

Cost 629
Execution time : 118 ms

Performance du FTS - étape 1

SELECT title FROM movie

WHERE
  to_tsquery('english', 'eat')
  @@
  (to_tsvector('english', title) || to_tsvector('english', overview))

ORDER BY
  ts_rank(
      setweight(to_tsvector('english', title), 'A')
      || setweight(to_tsvector('english', overview), 'B'),
      to_tsquery('english', 'eat')
  ) DESC;

Des résultats pertinents et ordonnés mais

Cost 6562
Execution time 542 ms

SELECT title FROM movie

WHERE text_search_vector @@ to_tsquery('english', 'eat')

ORDER BY ts_rank(text_search_vector, to_tsquery('english', 'eat')) DESC;

Performance du FTS - étape 2

Cost 131
Execution time 0.2 ms

UPDATE movie
SET text_search_vector =
  setweight(to_tsvector('english', coalesce(title, '')), 'A')
  || setweight(to_tsvector('english', coalesce(overview, '')), 'B');

ALTER TABLE movie ADD COLUMN text_search_vector TSVECTOR;

Solutions alternatives

Algolia et Elastic Search :

Pour aller plus loin sans réinventer la roue

Conclusion

Le full-text search de PostgreSQL, c'est :

performant
rapide à mettre en place
$ gratuit $

Vous n'avez pas forcément besoin de sortir l'artillerie lourde.

Avez-vous des questions ?

Ressources utiles :

- Mon repo avec toutes les expérimentations : https://github.com/pleportz/postgresql-sandbox

- Doc PostgreSQL sur les index à utiliser : https://www.postgresql.org/docs/9.1/textsearch-indexes.html

- "Postgres full-text search is Good Enough!" : http://rachbelaid.com/postgres-full-text-search-is-good-enough/

- Pour aller plus loin :

- les MATERIALIZED VIEWS : https://www.postgresql.org/docs/current/rules-materializedviews.html

- zombodb - Making Postgres and Elasticsearch work together : https://github.com/zombodb/zombodb

Pierre-Louis Le Portz

Développeur chez BAM

github: pleportz // twitter: @leportzp

Un moteur de recherche avec PostgreSQL

Recherche textuelle

1. Moteur de recherche en PostgreSQL

2. Performance

3. Solutions alternatives

Approche naïve avec ILIKE

to_tsvector

to_tsquery

@@

ts_rank

Full-Text Search de PostgreSQL

to_tsquery

to_tsvector

@@

l'opérateur de correspondance

to_tsvector

to_tsquery

@@

ts_query @@ ts_vector

ts_rank

En assemblant tout ...

to_tsvector

to_tsquery

@@

ts_rank

Performance

10.000 films

Performance du ILIKE

Performance du FTS - étape 1

Performance du FTS - étape 2

Algolia et Elastic Search :

Pour aller plus loin sans réinventer la roue

Conclusion

Un moteur de recherche textuel avec PostgreSQL

Un moteur de recherche textuel avec PostgreSQL

Pierre-Louis Le Portz

Un moteur de recherche avec PostgreSQL

Recherche textuelle

1. Moteur de recherche en PostgreSQL

2. Performance

3. Solutions alternatives

Approche naïve avec ILIKE

to_tsvector

to_tsquery

@@

ts_rank

Full-Text Search de PostgreSQL

to_tsquery

to_tsvector

@@

l'opérateur de correspondance

to_tsvector

to_tsquery

@@

ts_query @@ ts_vector

ts_rank

En assemblant tout ...

to_tsvector

to_tsquery

@@

ts_rank

Performance

10.000 films

Performance du ILIKE

Performance du FTS - étape 1

Performance du FTS - étape 2

Algolia et Elastic Search :

Pour aller plus loin sans réinventer la roue

Conclusion

Un moteur de recherche textuel avec PostgreSQL

More from Pierre-Louis Le Portz