The Hidden Superpowers of PostgreSQL: Fuzzy Search

PostgreSQL isn't just a database. It's a Swiss Army knife hiding in plain sight.

Our client's application at Dennenboom was growing faster than expected. What started as a simple CRUD app was now handling thousands of records with users demanding more sophisticated search capabilities.

The data was already in PostgreSQL. Everything was working smoothly... except users wanted to find records even when their spelling wasn't perfect.

The Search Dilemma

When users started requesting the ability to find records despite typos or spelling variations, my first instinct was to reach for specialized search tools. The standard approach would be:

Export data to Elasticsearch or similar search engine
Create indexing pipelines
Implement synchronization between PostgreSQL and the search service
Add new infrastructure to our stack

This meant:

Data duplication
Sync issues
More infrastructure to maintain
Higher costs
More potential points of failure

I could already feel the complexity creeping in.

The PostgreSQL Alternative

Instead of reaching for another tool, I decided to dig deeper into PostgreSQL's capabilities.

"What if PostgreSQL already has what we need?"

That's when I discovered the pg_trgm extension - a powerful but often overlooked feature that enables fuzzy search capabilities directly within PostgreSQL.

Diving Deep: How pg_trgm Works

The pg_trgm extension (trigram) works on a surprisingly simple yet powerful concept: it breaks text into three-character sequences and then compares these sequences between different strings.

For example, the word "hello" would be broken down into:

" h" (two spaces and h)
" he" (space, h, and e)
"hel"
"ell"
"llo"
"lo " (l, o, and space)
"o " (o and two spaces)

When comparing two strings, PostgreSQL calculates how many of these trigrams they share in common. The more they share, the more similar they are.

To enable this functionality, we first install the extension:

CREATE EXTENSION pg_trgm;

At this point, we can already start using fuzzy search functionality. For larger datasets, you might want to add an index to improve performance, though it's optional for achieving the functionality:

-- Optional: Add an index for better performance on large datasets

CREATE INDEX idx_product_name_trigram ON products USING gin (name gin_trgm_ops);

If you do create an index, here's what it would do:

USING gin: Creates a Generalized Inverted Index (GIN), which can speed up trigram searches.
gin_trgm_ops: This operator class optimizes the index for trigram matching operations.

But remember, the index is optional - you'll get the fuzzy search functionality either way.

Similarity Operators in Action

The pg_trgm extension provides several operators:

% - The similarity operator, returns true if the similarity is greater than the current threshold
<-> - Returns the "distance" between strings (lower means more similar)
similarity() - Function that returns a value between 0 (completely different) and 1 (identical)

Let's look at a practical example:

SELECT name, similarity(name, 'accomodation')

FROM products

WHERE name % 'accomodation'

ORDER BY similarity(name, 'accomodation') DESC

LIMIT 5;

This query finds product names similar to "accomodation" (note the misspelling - it's missing a 'm'). The results were impressive:

"accommodation" (0.82 similarity)
"accommodations" (0.76 similarity)
"accommodation services" (0.65 similarity)
"luxury accommodation" (0.61 similarity)
"temporary accommodation" (0.58 similarity)

The database correctly identified the properly spelled word, despite the typo in the search term!

Fine-Tuning: Adjusting Sensitivity

One of the most powerful aspects of pg_trgm is the ability to adjust how strict or lenient your fuzzy matches are by setting the similarity threshold:

-- Make matching more strict (default is 0.3)

SET pg_trgm.similarity_threshold = 0.5;

-- Make matching more lenient

SET pg_trgm.similarity_threshold = 0.2;

This allows you to balance between:

Too many false positives (irrelevant matches)
Too many false negatives (missing relevant matches)

Performance Impact

The big question: what's the performance cost?

We ran benchmarks against our dataset of 50,000 records:

Exact match query: 5ms
Fuzzy match query: 15ms

A mere 10ms difference for an exponentially more useful feature.

The GIN index ensures that PostgreSQL doesn't need to perform a full table scan, making fuzzy searches almost as fast as exact matches.

Beyond Simple Misspellings

The pg_trgm extension isn't just for catching typos. It's remarkably versatile:

Implementing "Sounds Like" Search

Combined with the fuzzystrmatch extension, you can create phonetic search capabilities:

CREATE EXTENSION fuzzystrmatch;

SELECT name

FROM products

WHERE metaphone(name, 10) = metaphone('accommodation', 10)

OR name % 'accommodation';

This would catch both similar spellings and words that sound similar when pronounced.

Multiple Column Search

You can extend fuzzy search across multiple columns:

CREATE INDEX idx_product_full_text ON products

USING gin ((name || ' ' || description) gin_trgm_ops);

SELECT name, description

FROM products

WHERE (name || ' ' || description) % 'ergonomic chair';

This would find products where either the name or description (or a combination) matches the search term.

Real-World Impact

For our client, implementing fuzzy search directly in PostgreSQL meant:

Zero data synchronization issues
No additional infrastructure
Simpler codebase
Lower costs
Fewer things to monitor and maintain

All while providing the advanced search capabilities users needed.

We added autocomplete, handled typos, and even implemented "did you mean?" suggestions - all without leaving PostgreSQL.

The Deeper Question

This experience raised a more profound question: What other superpowers are hiding in the tools we already use? How often do we reach for new technologies when our existing stack could handle the job?

PostgreSQL isn't just a relational database - it's an extensible platform with surprising capabilities if you know where to look.

Implementation Tips

If you want to implement fuzzy search in your PostgreSQL database, here are some practical tips:

Start simple: You can get fuzzy search working without any indices first. Only add indices if you notice performance issues with larger datasets.
Consider index types: If you do need an index, GIN indices typically provide better performance for fuzzy search, but they require more storage space.
Materialized views: For complex fuzzy searches across multiple tables, consider creating a materialized view with pre-computed data.
Benchmark with real data: Fuzzy search performance depends heavily on your specific dataset and query patterns.
Combine with full-text search: For more advanced scenarios, you can combine trigram similarity with PostgreSQL's full-text search capabilities.

Conclusion

Next time you find yourself reaching for a specialized search engine to handle fuzzy queries, consider whether PostgreSQL's built-in capabilities might already give you what you need.

Sometimes the best solution isn't a new tool, but a deeper understanding of the ones already in your toolbox.

At Dennenboom, we believe in making the most of what we have before adding complexity. This is the first in our series exploring PostgreSQL's hidden superpowers.

Ayoub EZZINI

I'm a Software Engineer!

Categorieën

PostgreSQL English

Terug naar blog overzicht

The Hidden Superpowers of PostgreSQL: Fuzzy Search

The Search Dilemma

The PostgreSQL Alternative

Diving Deep: How pg_trgm Works

Similarity Operators in Action

Fine-Tuning: Adjusting Sensitivity

Performance Impact

Beyond Simple Misspellings

Implementing "Sounds Like" Search

Multiple Column Search

Real-World Impact

The Deeper Question

Implementation Tips

Conclusion

Ayoub EZZINI

Categorieën

Do you want to be kept up to date?