1

I quite often need to filter records in a specific table on the basis of the existence of a substring in a text column. Specifically I need to exclude records that contain a /.

I currently use a WHERE statement such as:

WHERE table_name.text_col NOT LIKE "%/%"

My hunch is that searching the strings of every record for this substring takes a long time (relatively) and could be improved by indexing in some way. I could create a new binary indexed column and populate this based on whether the text column contains / but I was wondering if there is a neater solution for this?

I found this question which refers to a LEFT() style solution but I didn't understand the syntax and I'm looking for something that can cope with the substring being anywhere in the string.

0

3 Answers 3

1

You could create a computed column that persistently stores the information:

alter table table_name 
    add column text_has_slash tinyint
    generated always as (text_col like '%/%')
    stored
;

Or, if you want to treat null values as negatives:

    generated always as (coalesce(text_col like '%/%', 0))

The column value is computed and stored in the table (it is automatically updated by the database when the value changes).

Now you can use that column in your query:

select * from table_name where not text_has_slash;

Demo on DB Fiddle

Filtering on a pre-computed value should already improve performance.

Creating an index on a boolean column does not necessarily help, because there are only three possible values (0, 1, null). Unless the values are very unevenly distributed, it is often faster for the database to perform a full scan. On the other hand, you might want to include this column in a multi-column index, if you have more search criteria than those you have shown.

Sign up to request clarification or add additional context in comments.

Comments

1

The real issue is whether the entire table needs to be checked versus whether there is some way to limit the number of rows via an index.

First, let's decide whether such an index will even get used. As a Rule of Thumb, if more than 20% of the table is matched by some index, that index won't be used. (The "20" varies depending on the phase of the moon.) The logic is that bouncing between the index's BTree and the data's BTree costs something. Such bouncing is worth the effort if there aren't many rows -- that is when the index is "selective".

So, if more than 20% of the rows have "/", none of the suggestions will be efficient. LOCATE may be more efficient than LIKE; a REGEXP is probably slower than either. Still, the main cost in the query will be having to look at every row.

If, on the other hand, very few rows have "/", then any of the pre-computed indexes will be beneficial.

If the real test is WHERE x LIKE '%/%' AND ..., then we need to look at the secondary part of the test. It may be that even a non-selective test for "/" can be effectively combined with the other part of the test.

Bottom line: Give us the complete picture, plus some statistics.

1 Comment

I would place the usefulness of an index way below 20%, probably at 5%, but that's just me.
0

Maybe LOCATE can help you.

WHERE LOCATE('/', table_name.text_col) = 0 

When LOCATE returns 0 it means that substring isn't find in string. More info at https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_locate.

You asked for LEFT(), which isn't what you're looking for. This function returns substring from beginning of string. Syntax is simple, https://dev.mysql.com/doc/refman/8.0/en/string-functions.html#function_left

4 Comments

This will also need to examine all records (a full table scan) to get the results.
Thanks. Will this be any faster than my current WHERE statement? Btw - I wasn't considering the LEFT() solution as I need something that "can cope with the substring being anywhere in the string."
@Luuk - thanks, you answered my question :) Do I have any indexing options here?
If you really need to query this, and you cannot change the data inside that column, then your own solution (creating a ' binary indexed column' ) should be the best.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.