0

I have a table with ~250 columns and 10m rows in it. I am selecting 3 columns with the where clause on an indexed column with an IN query. The number of ids in the IN clause is 2500 and the output is limited by 1000 rows, here's the rough query:

select col1, col2, col3 from table1 where col4 in (1, 2, 3, 4, etc) limit 1000;

This query takes much longer than I expected, ~1s. On an indexed integer column with only 2500 items to match, it seems like this should go faster? Maybe my assumption there is incorrect. Here is the explain:

http://explain.depesz.com/s/HpL9

I did not paste all 2500 ids into the EXPLAIN just for simplicity so ignore the fact that there are only 3 in that. Anything I am missing here?

6
  • 1
    I recommend storing values in a temporary table with one column - primary key - and then use INNER JOIN or WHERE EXISTS. If you have index on table1's col1, col2, col3 and col4, table seek/scan will not be needed due to covering index. Commented May 28, 2015 at 3:58
  • 1
    From where are you getting the 2500 values? Is it from a query? If so, please show it. Commented May 28, 2015 at 4:33
  • @zedfoxus has a valid point. Just one note: the index should be on col4 first, then other columns. Commented May 28, 2015 at 5:20
  • @VladimirBaranov there is only one index and it is on col4, the rest are just being selected. Commented May 28, 2015 at 6:25
  • @Bohemian - The 2500 ids are in memory at this point in our code although they originated from another table on a hash field although that's somewhat irrelevant at this point right? Commented May 28, 2015 at 6:27

2 Answers 2

2

It looks like you're pushing the limits of select x where y IN (...) type queries. You basically have a very large table with an large set of conditions to search on.

Depending on the type of indexes, I'm guessing you have B+Tree this kind of query is inefficient. These type of indexes do well with general purpose range matching and DB inserts while performing worse on single value lookups. Your query is doing ~2500 lookups on this index for single values.

You have a few options to deal with this...

  • Use Hash indexes (these perform much better on single value lookups)
  • Help out the query optimizer by adding in a few range based constraints, so you could take the 2500 values and find the min and max values and add that to the query. So basically append where x_id > min_val and x_id < max_val
  • Run the query in parallel mode if you have multiple db backends, simply breakup the 2500 constraints into say 100 groups and run all the queries at once and collect the results. It will be better if you group the constraints based on their value

The first option is certainly easier, but it will come at a price of making your inserts/deletes slower.

The second does not suffer from this, and you don't even need to limit it to one min max group. You could create N groups with N min and max constraints. Test it out with different groupings and see what works.

The last option is by far the best performing of course.

Sign up to request clarification or add additional context in comments.

2 Comments

So unfortunately the hash index didn't help as much as I had hoped but I will give the range constraints a shot. Might be a bit difficult because the id ranges are really large and randomly distributed for most of the ids we'll be querying on.
In this case perhaps you should look at what @Bohemian hinted at. Depending on how you get these 2500 values, the solution follows from that. If you have a column or set of columns that groups all these 2500 ids your query will be much simpler and faster. If no such column(s) exists, you can take advantage of pre-computation and basically have a process that calculates such a value at time of insertion or a cron script that will run though the table and do the computation and store it in the table.
0

Your query is equivalent to:

select col1, col2, col3 
from table1 
where 
    col4 = 1
    OR col4 = 2
    OR col4 = 3
    OR col4 = 4
    ... repeat 2500 times ...

which is equivalent to:

select col1, col2, col3 
from table1 
where col4 = 1

UNION

select col1, col2, col3 
from table1 
where col4 = 2

UNION

select col1, col2, col3 
from table1 
where col4 = 3

... repeat 2500 times ...

Basically, it means that the index on a table with 10M rows is searched 2500 times. On top of that, if col4 is not unique, then each search is a scan, which may potentially return many rows. Then 2500 intermediate result sets are combined together.

The server doesn't know that the 2500 IDs that are listed in the IN clause do not repeat. It doesn't know that they are already sorted. So, it has little choice, but do 2500 independent index seeks, remember intermediate results somewhere (like in an implicit temp table) and then combine them together.

If you had a separate table table_with_ids with the list of 2500 IDs, which had a primary or unique key on ID, then the server would know that they are unique and they are sorted.

Your query would be something like this:

select col1, col2, col3 
from 
    table_with_ids 
    inner join table1 on table_with_ids.id = table1.col4

The server may be able to perform such join more efficiently.

I would test the performance using pre-populated (temp) table of 2500 IDs and compare it to the original. If the difference is significant, you can investigate further.

Actually, I'd start with running this simple query:

select col1, col2, col3 
from table1 
where 
    col4 = 1

and measure the time it takes to run. You can't get better than this. So, you'll have a lower bound and a clear indication of what you can and can't achieve. Then, maybe change it to where col4 in (1,2) and see how things change.

One more way to somewhat improve performance is to have an index not just on col4, but on col4, col1, col2, col3. It would still be one index, but on several columns. (In SQL Server I would have columns col1, col2, col3 "included" in the index on col4, rather than part of the index itself to make it smaller, but I don't think Postgres has such feature). In this case the server should be able to retrieve all data it needs from the index itself, without doing additional look-ups in the main table. Make it the so-called "covering" index.

2 Comments

So I tried putting the ids in a different table with the unique clause and then joining for the query and that didn't help. I'll consider next looking into what @Bohemian said and see if there's a better way to get those 2500 values into this query so the db can optimize the query more internally. Any other thoughts? Is this just something that shouldn't be done in SQL?
SQL server is usually good at dealing with a lot of data, good at searching and filtering the data. In your case it reads 10M rows, so it will take some time. In essence, there are two options: 1) scan all 10M rows once and filter out 2500 needed rows during the scan. 2) do 2500 lookups by the index, which may be faster or may be slower.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.