Query to show the limitations of B-Tree Index in mysql database

Question

I want to show some limitations on certain index types in mysql. I've read that using B-Tree index on columns that holds Boolean data types would be ineffective because any search query with this index type either for a True or False outcome would have to perform a full table scan. How can I show this in a sql query? I've tried the following query but it can't prove that the claim above is true, the issue is if I change the gender in my query to 'F', there's no advantage of using the index that I created. Please can someone show me a query that can prove that B-Tree index is ineffective on Boolean columns? I'm using the popular Employees database. Please note (the emp_no column holds unique values but I've added the gender column through the WHERE clause). Thanks for any help.

use employees;

CREATE INDEX indx_emp on employees(emp_no);

SELECT 
    *
FROM
    employees USE INDEX (indx_emp)
WHERE
    emp_no BETWEEN 10980 AND 100000
        AND gender = 'M'
ORDER BY birth_date;

Please provide SHOW CREATE TABLE. And tell up what percentage of gender is 'F'. — Rick James
– Rick James, Commented Apr 5, 2021 at 6:27
This 'composite' index would be the optimal index for that query: INDEX(gender, emp_no). — Rick James
– Rick James, Commented Apr 5, 2021 at 6:29
For what it's worth, literally hundreds of programmer years have gone into optimizing the query planning function (the stuff that decides what indexes to use) in MySQL. And those are highly skilled programmers. You're asking for a way to prove they missed a particular use case -- a common use case. That's a big ask, because (a) they probably didn't miss it entirely, and (b) things get better in every release of the database server. Proving a negative about something as complex as a query planner is difficult unless you read its source code. — O. Jones
– O. Jones, Commented Apr 6, 2021 at 16:08

Bill Karwin · Accepted Answer · 2021-04-05 03:16:03Z

3

Using an index on boolean columns is just as effective as an index on columns of any other data type.

If you hear someone claim that an index on a boolean is not effective, they're making the assumption that the true and false values each occur on close to 50% of the rows.

In fact, if you have a column that is 90% true and 10% false, then searching an index for the rows with false will benefit from the index. But if you search the index for rows with true, then reading the index is needless overhead. You would have been better off just doing a table-scan.

This works the same way for any other data type. For example, if you search for some integer value (or range of values) that matches more than 20% of the rows in the table, the optimizer is likely to skip the index and just do a table-scan. The extra work it would take to read the index, dereference the pointers to rows, and then read the rows, is considered more costly than just scanning all the rows in the table and throwing out those that don't match your condition.

If you search an integer column that has only two distinct values, and each of the values are found on about 50% of the rows in table, this is equivalent to searching for booleans where true and false occur in about 50% of the table, respectively. The index wouldn't be very selective in either case, so it's likely to be skipped by the optimizer. The same is true for varchar or datetime, or any other indexed column type.

I'm not providing a specific query as you requested, because the query you showed already could demonstrate it. Whether the index is used depends on the data and how selective the values you are searching for in that index.

You should analyze the query with EXPLAIN and compare the optimizer's choice and how many rows it estimates it will read.

The 20% figure is not an official threshold. It's not documented, it's just something I have observed. It might vary based on what data types you are searching, or the implementation might change in some other version of the software.

edited Apr 5, 2021 at 3:16

answered Apr 5, 2021 at 3:08

Bill Karwin

567k87 gold badges709 silver badges869 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Tunde Over a year ago

Thanks for your response, I get your point, it's quite clear but PLEASE can you share any query that can show any limitation of any index type on this popular employees database? Thanks for your time.

Rick James Over a year ago

@Tunde - We need to know the distribution of the data in each column of your table.

Tunde Over a year ago

@RickJames the database that I'm using can be found here - dev.mysql.com/doc/employee/en/sakila-structure.html The structure is there and it has 300,000 records. Thanks a lot.

Solarflare Over a year ago

@Tunde You are looking for a query that proves your claim is true, but since it isn't, noone can give you one. As Bill tried to explain to you, it depends on your data distribution if an index is used for a specific query. Try e.g. explain select * from customer where last_name >= 'A' and last_name <= 'M' and explain select * from customer where last_name >= 'A' and last_name <= 'B' on sakila. One uses the index, one doesn't, depending on how many rows fit your search condition. The same works for indexes on boolean, just find (or add) a column that isn't 50/50.

Bill Karwin Over a year ago

@O.Jones You say "of course" as though anyone would know that this is how indexes work! :-) In fact not every SQL implementation supports covering indexes.

|

Collectives™ on Stack Overflow

Query to show the limitations of B-Tree Index in mysql database

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related