Efficient way to determine if large set of rows exists in postgres

Question

I have an idempotent background processing task that takes row of information, does some clean up and inserts into a database. My problem is that the same information may be processed more than once.

To deal with this, I created a key (hashed) from information I have about each row of data and created a unique constraint on an index to prevent duplicates.

The problem: I check if the data already exists in the DB by doing:

SELECT key FROM items WHERE key IN (key,key,key,key).

I found this query to be a bit faster, but still have some slow responses

SELECT key FROM items WHERE (key = ANY(VALUES(key),(key)))

I then do an intersection of the keys returned and the keys I expect and only process the data that does not already exist.

This worked well until the table reached 100 million plus and I can be checking for 100+ keys at a time which is causing a fair amount of IO scanning and retrieving each row.

My question: Is there a more efficient way to check for existence using the unique constraint and the index? Perhaps something that does not actually go to each row?

Or, is there a different approach that might work? Would simply attempting to insert and catching the unique constraint violation actually be faster?

Simplified table definition:

Column         |            Type             |                           Modifiers                           | Storage  | Description
------------------------+-----------------------------+---------------------------------------------------------------+----------+-------------
 id                     | integer                     | not null default nextval('items_id_seq'::regclass) | plain    |
 created_at             | timestamp without time zone | not null                                                      | plain    |
 updated_at             | timestamp without time zone | not null                                                      | plain    |
 key                    | character varying(255)      |                                                               | extended |
 item_attributes        | hstore                      |                                                               | extended |
 item_name              | character varying(255)      |                                                               | plain    |
Indexes:
    "items_pkey" PRIMARY KEY, btree (id)
    "index_items_on_key" UNIQUE, btree (key)

And a query plan:

Nested Loop  (cost=0.10..108.25 rows=25 width=41) (actual time=0.315..2.169 rows=25 loops=1)
   ->  HashAggregate  (cost=0.10..0.17 rows=25 width=32) (actual time=0.071..0.097 rows=25 loops=1)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.09 rows=25 width=32) (actual time=0.009..0.033 rows=25 loops=1)
   ->  Index Scan using index_items_on_key on items  (cost=0.00..4.32 rows=1 width=41) (actual time=0.076..0.077 rows=1 loops=25)
         Index Cond: ((key)::text = "*VALUES*".column1)
 Total runtime: 2.406 ms

Include your table definition including indexes and ideally the explain plan for the query you are executing. — Kuberchaun
– Kuberchaun, Commented Mar 11, 2014 at 2:29
Attempting the insert and ignoring constraint violations will probably be cleaner. It probably won't be much faster, because the IO needed to check the constraint shadows the IO needed to do the insert, and so has to be done either way. — jjanes
– jjanes, Commented Mar 11, 2014 at 2:33

Clodoaldo Neto · Accepted Answer · 2014-03-12 08:58:23Z

2

You don't say where does the data come from and how it is processed. This is the generic approach

with to_be_inserted (id, key) as (
    values (1, 'the_hash'), (2, 'another_hash')
)
insert into items (id, key)
select f(id, key)
from to_be_inserted tbi
where not exists (
    select 1
    from items
    where key = tbi.key
);

There is a potential for a significant performance gain if you store the hash as bytea in instead of as text as it is half the size making the index also a halve. And use the smaller md5 hash.

If the processing can't be done in SQL this key seek might be faster

with might_be_inserted (key) as (
    values ('hash1'), ('hash2')
)
select key 
from might_be_inserted mbi
where not exists (
    select 1
    from items
    where key = mbi.key
)

edited Mar 12, 2014 at 8:58

answered Mar 11, 2014 at 8:50

Clodoaldo Neto

127k30 gold badges251 silver badges274 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ross Over a year ago

Thanks for the idea, in this case it does not really matter where the data comes from or is processed other than to say, I'd rather not process data that has already been inserted to the DB, i.e. where the key exists in the unique index. In this case there can be some heavy lifting per item, so not doing the processing again is my goal. I believe the ideal situation would be a index only lookup of the keys that exist and then only process the data who's keys do not already exist.

Clodoaldo Neto Over a year ago

@Ross That is what my code does. It will only process if the key does not exist. I changed it to select a processing function to make it explicit. And that is why I asked how the data is processed. It indeed matters where does it come from and how it is processed. Only with that information it is possible to build specific code around it.

Ross Over a year ago

Ah, I see. In this case, the data will be preprocessed in ruby to determine what specifically to insert into the database. I only need to do that preprocessing if the key does not already exist in the database. So I would not have the to_be_inserted data unless the key does not exist and I perform the preprocessing.

Collectives™ on Stack Overflow

Efficient way to determine if large set of rows exists in postgres

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related