2

I have an idempotent background processing task that takes row of information, does some clean up and inserts into a database. My problem is that the same information may be processed more than once.

To deal with this, I created a key (hashed) from information I have about each row of data and created a unique constraint on an index to prevent duplicates.

The problem: I check if the data already exists in the DB by doing:

SELECT key FROM items WHERE key IN (key,key,key,key).

I found this query to be a bit faster, but still have some slow responses

SELECT key FROM items WHERE (key = ANY(VALUES(key),(key)))

I then do an intersection of the keys returned and the keys I expect and only process the data that does not already exist.

This worked well until the table reached 100 million plus and I can be checking for 100+ keys at a time which is causing a fair amount of IO scanning and retrieving each row.

My question: Is there a more efficient way to check for existence using the unique constraint and the index? Perhaps something that does not actually go to each row?

Or, is there a different approach that might work? Would simply attempting to insert and catching the unique constraint violation actually be faster?

Simplified table definition:

Column         |            Type             |                           Modifiers                           | Storage  | Description
------------------------+-----------------------------+---------------------------------------------------------------+----------+-------------
 id                     | integer                     | not null default nextval('items_id_seq'::regclass) | plain    |
 created_at             | timestamp without time zone | not null                                                      | plain    |
 updated_at             | timestamp without time zone | not null                                                      | plain    |
 key                    | character varying(255)      |                                                               | extended |
 item_attributes        | hstore                      |                                                               | extended |
 item_name              | character varying(255)      |                                                               | plain    |
Indexes:
    "items_pkey" PRIMARY KEY, btree (id)
    "index_items_on_key" UNIQUE, btree (key)

And a query plan:

Nested Loop  (cost=0.10..108.25 rows=25 width=41) (actual time=0.315..2.169 rows=25 loops=1)
   ->  HashAggregate  (cost=0.10..0.17 rows=25 width=32) (actual time=0.071..0.097 rows=25 loops=1)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.09 rows=25 width=32) (actual time=0.009..0.033 rows=25 loops=1)
   ->  Index Scan using index_items_on_key on items  (cost=0.00..4.32 rows=1 width=41) (actual time=0.076..0.077 rows=1 loops=25)
         Index Cond: ((key)::text = "*VALUES*".column1)
 Total runtime: 2.406 ms
3
  • Include your table definition including indexes and ideally the explain plan for the query you are executing. Commented Mar 11, 2014 at 2:29
  • Attempting the insert and ignoring constraint violations will probably be cleaner. It probably won't be much faster, because the IO needed to check the constraint shadows the IO needed to do the insert, and so has to be done either way. Commented Mar 11, 2014 at 2:33
  • @JustKim I added a table definition with the indexes. Commented Mar 11, 2014 at 7:06

1 Answer 1

2

You don't say where does the data come from and how it is processed. This is the generic approach

with to_be_inserted (id, key) as (
    values (1, 'the_hash'), (2, 'another_hash')
)
insert into items (id, key)
select f(id, key)
from to_be_inserted tbi
where not exists (
    select 1
    from items
    where key = tbi.key
);

There is a potential for a significant performance gain if you store the hash as bytea in instead of as text as it is half the size making the index also a halve. And use the smaller md5 hash.

If the processing can't be done in SQL this key seek might be faster

with might_be_inserted (key) as (
    values ('hash1'), ('hash2')
)
select key 
from might_be_inserted mbi
where not exists (
    select 1
    from items
    where key = mbi.key
)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the idea, in this case it does not really matter where the data comes from or is processed other than to say, I'd rather not process data that has already been inserted to the DB, i.e. where the key exists in the unique index. In this case there can be some heavy lifting per item, so not doing the processing again is my goal. I believe the ideal situation would be a index only lookup of the keys that exist and then only process the data who's keys do not already exist.
@Ross That is what my code does. It will only process if the key does not exist. I changed it to select a processing function to make it explicit. And that is why I asked how the data is processed. It indeed matters where does it come from and how it is processed. Only with that information it is possible to build specific code around it.
Ah, I see. In this case, the data will be preprocessed in ruby to determine what specifically to insert into the database. I only need to do that preprocessing if the key does not already exist in the database. So I would not have the to_be_inserted data unless the key does not exist and I perform the preprocessing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.