3

I have a couple of tables that looks like this:

CREATE TABLE Entities (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   client_id INT NOT NULL,
   display_name VARCHAR(45),
   PRIMARY KEY (id)
)

CREATE TABLE Statuses (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   PRIMARY KEY (id)
)

CREATE TABLE EventTypes (
   id INT NOT NULL AUTO_INCREMENT,
   name VARCHAR(45) NOT NULL,
   PRIMARY KEY (id)
)

CREATE TABLE Events (
   id INT NOT NULL AUTO_INCREMENT,
   entity_id INT NOT NULL,
   date DATE NOT NULL,
   event_type_id INT NOT NULL,
   status_id INT NOT NULL
)

Events is large > 100,000,000 rows

Entities, Statuses and EventTypes are small < 300 rows a piece

I have several indexes on Events, but the ones that come into play are

idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)

I have a large complicated query, but I'm getting the same slow query results with a simpler one like the one below (note, in the real queries, I don't use evt.*)

SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
   JOIN `Entities` ent ON evt.entity_id = ent.id
   JOIN `EventTypes` et ON evt.event_type_id = et.id
   JOIN `Statuses` s ON evt.status_id = s.id
WHERE
   evt.date BETWEEN @start_date AND @end_date AND
   evt.entity_id IN ( 19 ) AND -- this in clause is built by code
   evt.event_type_id = @type_id

For some reason, mysql keeps choosing the index which doesn't cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:

SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
   JOIN `Entities` ent ON evt.entity_id = ent.id
   JOIN `EventTypes` et ON evt.event_type_id = et.id
   JOIN `Statuses` s ON evt.status_id = s.id
WHERE
   evt.date BETWEEN @start_date AND @end_date AND
   evt.entity_id IN ( 19 ) AND -- this in clause is built by code
   evt.event_type_id = @type_id

The query takes .014 seconds.

Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?

To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.

Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.

I tried adding an additional index, ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id) and it actually made the queries slower.

I will experiment a bit more, but I feel like I'm not sure how the optimizer makes it's decisions.

Additional Info:

I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec

I can't wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.

8
  • Please do "experiment a bit more", or start reading the chapter on Optimization, or find any of the answers given on stackoverflow, which have to do with this subject. Commented Dec 30, 2022 at 16:03
  • "For some reason, mysql keeps choosing the index which doesn't cover Events.date" => how many records are the between start_date and end_date ? If that is "a lot", then MySQL will decide that index is not to be used. When selecting just 1 day (start_date=end_date), or a couple of days, then MySQL might decide to use the index after all Commented Dec 30, 2022 at 16:10
  • Also status_id is in the index which you force to be used, but no filtering is done on that field. This is also a reason for NOT selecting that index. Commented Dec 30, 2022 at 16:13
  • @Luuk - I have been experimenting and reading about index optimization. The number of records between start and end date are much smaller compared to the total number of events, especially when taken with entity_id. Note that status_id is in both indexes. I do have some additional information though, it appears that the join with the status table is what is causing the index without date to be chosen. This is what confuses me. Since I'm not filtering by status_id, why wouldn't the optimizer pick an index that is more covering (date, entity_id, status_id, event_type_id) Commented Dec 30, 2022 at 16:22
  • The field status_id has been selected (evt.*), and need to be fetched anyways, There's not a real reason for using an index on that field. Commented Dec 30, 2022 at 16:33

3 Answers 3

1

I would add this index:

ALTER TABLE Events ADD INDEX (event_type_id, entity_id, date);

The order of columns is important. Put all column(s) used in equality conditions first. This is event_type_id in this case.

The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.

Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than = or IS NULL. So range conditions include >, !=, BETWEEN, IN(), LIKE (with no leading wildcard), IS NOT NULL, and so on.

The condition on entity_id is also an equality condition if the IN() list has one element. MySQL's optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed of IN (19) is typical, then all three columns of the index will be used for filtering.

It's still worth putting date in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It's not quite as good as a real index lookup, but it's worthwhile.

I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.

Sign up to request clarification or add additional context in comments.

Comments

1

Rearrange your indexes to have columns in this order:

  1. Any column(s) that will be tested with = or IS NULL.
  2. Column(s) tested with IN -- If there is a single value, this will be further optimized to = for you.
  3. One "range" column, such as your date.

Note that nothing after a "range" test will be used by WHERE.

(There are exceptions, but most are not relevant here.)

  • More discussion: Index Cookbook

  • Since the tables smell like Data Warehousing, I suggest looking into Summary Tables In some cases, long queries on Events can be moved to the summary table(s), where they run much faster. Also, this may eliminate the need for some (or maybe even all) secondary indexes.

  • Since Events is rather large, I suggest using smaller numbers where practical. INT takes 4 bytes. Speed will improve slightly if you shrink those where appropriate.

  • When you have INDEX(a,b,c), that index will handle cases that need INDEX(a,b) and INDEX(a). Keep the longer one. (Sometimes the Optimizer picks the shorter index 'erroneously'.)

8 Comments

What I dont understand though, given what you are saying, why the first index, with date in the first column improves the speed of the query by 2 orders of magnitude. It seems like the range indexing is way more important.
Summary tables wont work here, as I’m returning individual records.
@bpeikes - By first filtering on those two '=' tests, there is even less for the "range" to do.
But why is the index which is date, entity_id, …. 100 times faster? The range is first in that index, so the other columns in the index cant be used, except to filter rows.
Was the date range quite narrow? If so, it would be quite good (100x?) by itself. But if the date range is wider it is likely to be slower then 100x, and, my suggestion is likely to be better than it.
|
0

To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:

select ...
from (select date('2022-12-01') date union all select date('2022-12-02') union all select date('2022-12-03')) dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date

If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:

select ...
from dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
where dates.date between @start_date and @end_date

6 Comments

MariaDB lets you create a 'table' on the fly with its seq_1_to_100 type table.
@RickJames only if that engine is enabled. I don't find it worth the trouble to use, especially for dates
Example: SELECT '2019-01-01' + INTERVAL seq-1 DAY FROM seq_1_to_31;
Why would this “effectively” use a composite index ? Documentation seems to say indexes only on one range. The join is the same as an IN clause, which is a range.
right, you have to avoid using a range. joining from tables providing entity ids and dates causes it to look up each entity id and date combination separately instead of using a range. obviously there are cases where there will be too many combinations and that will be worse, but where it is not, it will behave reliably, instead of getting worse as data for more dates or entities is added.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.