Full text search with Hasura
Hasura is an open source GraphQL layer on top of PostgreSQL that makes it easy to manage your data, and automatically exposes it through a GraphQL endpoint.
Having now tinkered with it on a few smaller personal projects, and two professional projects, I've started wanting to do things that don't neceessarily come naturally on Hasura. Full text search is one of those.
Scenario
Let's use a simple table as an example:
CREATE TABLE "articles" ( "id" uuid NOT NULL DEFAULT gen_random_uuid(), "title" text, "content" text, "tags" text )
Our goal is to run full text search over a document comprising title
content
tags
.
The simple way
The easiest way to integrate a form of search, would be to use some basic PostgreSQL pattern matching - Which Hasura exposes in it's searches.
query Search($searchValue: String! = "%some search string%") { articles( where: { _or: [ { title: { _ilike: $searchValue } } { content: { _ilike: $searchValue } } { tags: { _ilike: $searchValue } } ] } ) { id title content tags } }
Simple. No frills. But there are some cons:
- No fuzzy search
- No ranking of results.
- No way of weighting the results based on whether the match is in the title, content, or tags.
- No dictionary - if you search for "amazingly" it won't match "amazing" or "amazed".
- It's super verbose, clunky to write, and prone to human programmer error!
So this is fine for very simple, and restrictive search scenarios. But we can do better.
Fuzzy Search via PostgreSQL Function
This example is taken directly from the Hasura documentation:
Open up the Hasura console, navigate to to Data > SQL
, then do the following:
CREATE EXTENSION pg_trgm;
We need this extension for creating the GIN index bellow
CREATE INDEX articles_gin_idx ON articles USING GIN ((title || ' ' || content || ' ' || tags) gin_trgm_ops);
As a non PostgreSQL expert, this felt like some hand wavey voodoo... What on earth is a GIN index? What on earth is a gin_trgm_ops
?
Lets start with the trgm
part:
A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.
So this function would break the word test
into t
te
tes
est
st
t
GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items.
So in this case, we're creating a composite item from title
,content
andtags
, turning it into a trigram and, for performance reasons, we want our PostgreSQL database to index this for faster retrieval.
Now create the actual function (make sure you check the box: "track this function")
CREATE FUNCTION search_articles(search text) RETURNS SETOF articles AS $$ SELECT * FROM articles WHERE search <% (title || ' ' || content || ' ' || tags) ORDER BY similarity(search, (title || ' ' || content || ' ' || tags)) DESC LIMIT 5; $$ LANGUAGE sql STABLE;
The key thing here is the similarity(arg1, arg2)
function. It's a function from the pg_trgm
module that splits both arguments into trigrams, and compares them. This means our search is quite forgiving on typos and spelling mistakes!
Not bad, but we can do better.
Full Text Search using a Dictionary
The issue with fuzzy search, is that it doesn't perform well on derivative words. eg: "adventure" versus "adventuring" or "adventured". What if the person running the search doesn't actually know the exact words to look for? Fuzzy search is forgiving of typos, but not of inprecise search terms.
Enter PostgreSQL Dictionaries.
Dictionaries allow fine-grained control over how tokens are normalized. With appropriate dictionaries, you can:
- Define stop words that should not be indexed.
- Map synonyms to a single word using Ispell.
- Map phrases to a single word using a thesaurus.
- Map different variations of a word to a canonical form using an Ispell dictionary.
- Map different variations of a word to a canonical form using Snowball stemmer rules.
Nice. REALLY nice. Now let's make it happen.
First, let's create a function that can take all the columns we're interested in, and turn them into a document, and save that document into a new column called ts_vector
:
CREATE OR REPLACE FUNCTION public.create_ts_vector() RETURNS trigger LANGUAGE plpgsql AS $function$ BEGIN NEW.ts_vector = setweight(to_tsvector('simple', coalesce(NEW.title, '')), 'A') || setweight(to_tsvector('simple', coalesce(NEW.content, '')), 'B') || setweight(to_tsvector(coalesce(NEW.tags '')), 'C') RETURN NEW; END; $function$;
Quite a few things happening here.
- the
coalesce(arg1, arg2)
function. This is a precaution, it returns the first non-null argument. We need to guarantee a string, so we use this. - the
to_tsvector(arg1)
function. According to the PostgreSQL docs:A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word... Sorting and duplicate-elimination are done automatically during input.
- the
setweight(arg1, arg2)
function. Which ascribes a weight (denoted by capitalised letters: A, B, C etc in descending order of importance), to that specific vector. This is useful for ordering or prioritising parts of our document. In this case, we've decided that thetitle
should be given the most weight when matched against search terms.
Ok, next step: a trigger
that runs anytime a row is inserted or updated, so that this document column can be created/updated:
CREATE TRIGGER "add_trigger" BEFORE INSERT OR UPDATE ON "public"."articles" FOR EACH ROW EXECUTE FUNCTION create_ts_vector();
Pretty self explanatory.
Final Step: create an actual search function and expose it to hasura.
CREATE OR REPLACE FUNCTION public.search_articles(search text) RETURNS SETOF articles LANGUAGE sql STABLE AS $function$ SELECT * FROM articles WHERE articles.ts_vector @@ plainto_tsquery(search) ORDER BY ts_rank(articles.ts_vector, plainto_tsquery(search)) DESC $function$;
Make sure to track this function from the Hasura console.
So, what's happening here?
- This function takes in a search term as text.
- It returns a set of articles.
- It takes the search term and uses it as an argument in the
plainto_tsquery()
function, which does basically the same thing as theto_tsvector()
function, but for the search query. - Compares and tries to match the vector to the query using the "@@" operator.
- Returns the articles ordered and ranked (remember how we set a weight to our search terms?) by
ts_rank()
Now, we have a new hasura graphql query type which we can run using:
query search($searchTerm: String) { search_articles($searchTerm) { id title content tags } }