Full Text Search (PostgreSQL)

Parsing Documents

PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type. A ts_vector is a sorted list of distinct lexemes:

  <database>#= SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector                      
  ----------------------------------------------------
  'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
  <database>#= SELECT to_tsvector('a fat cat sat on a mat and ate a fat rat');
                                  to_tsvector                                  
  -------------------------------------------------------------------------------
  'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
  <database>#= SELECT 'The quick brown fox jumps over the lazy dog'::tsvector;
                           tsvector                            
  ---------------------------------------------------------------
  'The' 'brown' 'dog' 'fox' 'jumps' 'lazy' 'over' 'quick' 'the'
  <database>#= SELECT to_tsvector('The quick brown fox jumps over the lazy dog');
                                to_tsvector                                
  ---------------------------------------------------------------------------
  'brown':3 'dog':9 'fox':4 'jumps':5 'lazy':8 'over':6 'quick':2 'the':1,7

A ts_query can be used to build lexemes and apply then to certain operators.

The match operator @@ returns true if a document satisfies the query and false otherwise. It can be used in the WHERE clause of a SELECT statement or anywhere else where a Boolean expression is expected.

(Dombrovskaya, Novikov, and Bailliekova 2021, 324)

  <database>#= SELECT 'fat & rat'::tsquery;
      tsquery    
  ---------------
   'fat' & 'rat'
  <database>#= SELECT to_tsvector('a fat cat sat on a mat and ate a fat rat') @@ to_tsquery('fat | cat') as result;
   result 
  --------
   t
  <database>#= SELECT to_tsvector('a fat cat sat on a mat and ate a fat rat') @@ to_tsquery('bat | hat') as result;
   result 
  --------
   f

you can leverage generated columns when creating tables with searchable content.

  CREATE TABLE user_comment (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    content TEXT NOT NULL,
    search_vector_en TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
  );

The following query picks all the movies whose titles contain "star" immediately followed by either "wars" or "trek".

  SELECT title
  FROM movies
  WHERE to_tsvector(title) @@ to_tsquery('star <-> (wars | trek)');

Highlights

To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. PostgreSQL provides a function ts_headline that implements this functionality.

Websearch

  SELECT
    plainto_tsquery('star trek') as plain,
    phraseto_tsquery('star trek') as phrase,
    websearch_to_tsquery('star trek -khan') as websearch;
  -- Outputs
        plain      |      phrase       |    websearch    
  -----------------+-------------------+-----------------
   'star' & 'trek' | 'star' <-> 'trek' | 'star' & 'trek' & !'khan'

Ranking

  SELECT
    *,
    ts_rank(
      -- base
      to_tsvector(title), 
      -- query
      to_tsquery('star & (wars | trek)'),
      -- weight (optional)
      -- 0 makes it so the length is not taken into account
      -- 1 favors shorter matches
      1 
    ) AS rank
  FROM bookmarks
  ORDER BY rank DESC;

Setting weights

setweight

Indexing

References:

Dombrovskaya, Henrietta, Boris Novikov, and Anna Bailliekova. 2021. Postgresql Query Optimization. Springer.

Backlinks: