Text Search in PostgreSQL

In this article, I'm documenting some of my findings when weighing different ways to implement text search in Postgres. We will also compare how these DB methods hold against external search systems like elasticsearch/Solr.

Available Options

Postgres offers several tools for pattern matching and searching text. There's:
  • LIKE, SIMILAR TO, Regex
  • Trigram index
  • Full Text Search
Each of these searching options can be implemented with multiple choices of index. Postgres provides several index types: B-tree, Hash, GiST, GIN & BRIN. Each index type uses a different algorithm that is best suited to different types of queries.

Database setup

LIKE & Regex for pattern search

To implement a search we can think of a simple pattern search where we are filtering the rows based on a exact pattern i.e LIKE pattern or ~ pattern
-- LIKE
-- "%" means wildcard character denoting any string
-- search users with username's starting with alice i.e pattern = 'alice%'
SELECT * FROM users WHERE username LIKE 'alice%';
-- search users with username's containing alice i.e pattern = '%alice%'
SELECT * FROM users WHERE username LIKE '%alice%';

-- Regex (~) & SIMILAR TO
-- ".*" denotes any string
-- SIMILAR TO expressions are rewritten into regular expressions internally
-- search users with username's starting with alice i.e pattern = '^alice.*'
SELECT * FROM users WHERE username ~ '^alice.*';
Searching on a million rows will require you to build an index on the searched column (or multi-columns).
EXPLAIN (COSTS OFF) SELECT * FROM users WHERE username LIKE 'alice%';
-- Seq Scan on users
The default index in Postgres is the B-tree which fits in most situations.
B-tree index only provide index scan on access predicates (= operator) and range queries. We need to create the index with a special operator class to support indexing of pattern-matching that too only left-anchored i.e 'pattern%' or 'patt%ern'. This prefix pattern search makes sense once you understand how a B-tree (B+tree) index works.
CREATE INDEX col_search ON table (col text_pattern_ops);
EXPLAIN (COSTS OFF) SELECT * FROM users WHERE username LIKE 'alice%';
-- Index Scan using col_search_idx on users

Trigram index

Another way of implementing a search instead of a direct pattern match can be a fuzzy search where the searches for a text that matches a term closely instead of a exact match like username matching "Ailce" search should also return the username "Alice" too.
To build this fuzzy search lets see the concept of Trigram. A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages. This similarity function allows for narrowing our filter similar to what levenshtein algorithm does.
By using the operator classes provided by the module pg_trgm provides index support for any LIKE/ ILIKE pattern using a GIN or GiST index, also supporting the word_similarity operators. Effectiveness of trigram index varies with the exact data set. For example, the trigram for "alice" would be:
-- install module CREATE EXTENSION pg_trgm;
select show_trgm('alice');
-- output - {" a"," al",ali,"ce ",ice,lic}
We can create index on these trigram's using a GIN or a GiST index operator classes provided by thepg_trgm.
GIN index (generalised inverted index) an inverted index maps entities to its position in a table. An inverted index is an index structure storing a set of (key, posting list) pairs, where 'posting list' is a set of documents in which the key occurs. GIN works well on composite data types (consisting of multiple elements e.g array, json, etc.)
Note: GIN only supports bitmap queries, as index tree only maintains the heap pointers.
GiST index (generalized search tree) a balanced search tree more optimized towards handling geodata, text documents, images, etc. In case of text a trie like structure is built which spread across multiple page index.
Note: GiST and GIN both index types can be used to speed up text searches. But, creating and maintaining these index carries a cost & there are substantial performance differences between the two of them. You can read further here to make a decision which index operator class fits your use case.
As a rule of thumb, a GIN index is faster to search than a GiST index, but slower to build or update. So, GIN is better suited for static data and GiST for often-updated data. I will be using GIN operator class index for our trigram exploration.
CREATE INDEX col_gin_trgm_idx ON table USING gin (col gin_trgm_ops);
SELECT col, similarity(col, 'word') AS sml FROM table WHERE col % 'word' ORDER BY sml DESC, col;
SELECT * FROM table WHERE col LIKE '%foo%bar';
-- The index search works by extracting trigrams from the search string and then looking these up in the index. The more trigrams in the search string, the more effective the index search is.
-- Unlike B-tree based searches, the search string need not be left-anchored.

Full Text Search

Postgres provides a advance full-text search out of box since PostgreSQL 8.3. It is based on TF-IDF concept. It involves multiple pre-index steps lemmatize, stemming, ignoring stop words, weight and rank search matches. This gives us the ability to index and search in large amounts of text data.
Imagine, we have a blog table having a million rows each containing documents that are not just meta-data but rather an abstract for an article, or full-text articles themselves, and you want to find out if certain words are present or not in them.
An effective way to approach this problem is by getting a semantic vector [vectorization] for all of the words contained in a document. So, when you search for a word like "jump", you will match all instances of the word and its tenses, even if you searched for "jumped" or "jumping". Additionally, you won't be searching the full document itself, but the vector [querying].
Postgres offers several native functions to implement this full-fledged text search, here are few to get started -
  • to_tsvector [vectorization]- creates a list of tokens of tsvector (text search vector) data type. Every token is a lexeme (unit of lexical meaning) with pointers (the positions in the document), and here words that carry little meaning, such as articles (the) and conjunctions (and, or) are conveniently omitted.
  • to_tsquery [querying] - querying the vector occurrences of a particular word/prhase
-- VECTORIZATION
SELECT to_tsvector('The quick brown fox jumped over the lazy dog.');
-- 'brown':3 'dog':9 'fox':4 'jump':5 'lazi':8 'quick':2

-- QUERYING
SELECT to_tsvector('The quick brown fox jumped over the lazy dog') @@ to_tsquery('fox');
-- Try FTS without Index
SELECT * FROM table WHERE to_tsvector('english', body) @@ to_tsquery('english', 'hello');

-- Lets try building an Index
CREATE INDEX v_search_idx ON table USING GIN (to_tsvector('english', body));


-- left anchored
SELECT * FROM users WHERE ts_username @@ to_tsquery('simple', 'alice:*');
-- search
select * from users where ts_username @@ to_tsquery('simple','alice');
PostgresSQL full text search works best when the text vectors are stored in physical columns with an index. This means you’d have to set up a stored procedure or database trigger to keep these text vector columns updated with newest value from the raw data.
Note: To implement multi-column text search using above options a new column maintaining these tokenized values including amalgam of these multi-column indexes will be required, with stored procedure or database trigger to keep them in sync.

External Search Engine -

Elasticsearch - A dedicated search system, maintaining a document with GIN index like manner while preserving the original document. ElasticSearch showcases searches & log analytics in near-real-time and a scalable manner. Elasticsearch (& other dedicated search systems) are far better-known for text search but here are few benefits of Postgres over a dedicated search system:
  • avoid operating additional infrastructure (e.g., spinning up an Elasticsearch cluster)
  • avoid syncing data between systems (e.g., every CRUD operation needs to be relayed from Postgres to Elasticsearch)
If search is your core offering, or you can’t afford to fit your searchable text in Postgres shared memory buffer, use Elasticsearch. Otherwise, Postgres is probably good enough. At the very least, you’ll end up with a competent baseline for further improvement.