8

I have a data table and I would like to filter the columns. For example, to search for a user by his fullname.

However, I would like to allow the user to enter the search phrase case-independent and accents-independent.

So I have checked these (and more) sources and questions:

I thought the nondeterministic collations might finally be the right way how to achieve that, but unfortunately I don't know how:

  • combine case_insensitive and ignore_accents into one collation
  • how to allow searching only by substring in such a WHERE (e.g., find "Jóhn Doe" only by the string "joh") as nondeterministic collations do not support LIKE or regular expressions
  • which index to use

I would be very grateful for any advice on how to finally deal with this type of problem.

Thanks!

1 Answer 1

9

Creating case and accent insensitive ICU collations is pretty simple:

CREATE COLLATION english_ci_ai (
   PROVIDER = icu,
   DETERMINISTIC = FALSE,
   LOCALE = "en-US-u-ks-level1"
);

Or, equivalently (that syntax also works wil old ICU versions:

CREATE COLLATION english_ci_ai (
   PROVIDER = icu,
   DETERMINISTIC = FALSE,
   LOCALE = "en-US@colStrength=primary"
);

See the ICU documentation for details and my article for a detailed discussion.

But your problem is that you want substring search. So you should create a trigram index:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;

CREATE INDEX ON tab USING gin (unaccent(doc) gin_trgm_ops);

Then you can search like this:

SELECT * FROM tab
WHERE unaccent(doc) ILIKE unaccent('%joh%');

Note that you have to force a minimal length of 4 or so on the search string if you want that to be efficient.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you, your article helped me to understand how creating a custom ICU collation works. However, unfortunately my problem with not being able to search with a non-deterministic ICU according to a portion of the string persists.
Thanks for the update, I thought I'd have to stick with this unaccent solution. I just have to create an IMMUTABLE version of unaccent as mentioned in my first link. Thank you!
Laurenz, should a trigram index enable ILIKE queries for a case-insensitive collation? I created a trigram index on a column with a case insensitive collation, but I still receive the error ERROR: nondeterministic collations are not supported for ILIKE when running an ILIKE query on that column (note no accents to worry about in my situation).
@michael Yes, that is a restriction that might be lifted with PostgreSQL v18.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.