0

I have two tables: old_data and new_data.

Both tables have columns: ID, date, value

I want to delete any rows in "old_data" which are not in "new_data", but only between selected dates.

This works in psql:

DELETE FROM old_data
WHERE (id, date) NOT IN (SELECT id, date FROM new_data) AND
    id = my_id  AND  date >= 'my_start_date'  AND  date <= 'my_end_date';

The start/end dates differ for each id, so I have to run the DELETE separately for each distinct id. There are about 1000 distinct id's in "new_data".

The problem is it is very slow - it takes an hour when "old_data" has 15 million rows and "new_data" has 100,000 rows.

Is there a more efficient way to do this?

7
  • 2
    Show the complete tables definitions including all constraints, checks and indexes for both tables. Commented May 27, 2015 at 2:37
  • Put the WHERE constraints into the subquery, and skip selecting date since you wont need it. That should help at least partially for a query with this structure. Commented May 27, 2015 at 2:45
  • @vol7ron: why would it make it slower? Do you take into account that without indexes it's a full scan over 100k x 15M rows (1.5e12 rows) Commented May 27, 2015 at 2:59
  • @zerkms that was a mistype, it should have read, it will be slower without indexes, or faster with — I'll delete the comment anyhow. The absence of an index will only make deletions/insertions faster when a where clause is not involved. You've already pointed out that he should list the table definition, which should bring some clarity. I'm also curious if he has the hardware to support the operations. Commented May 27, 2015 at 3:02
  • 2
    please show output of explain your_query Commented May 27, 2015 at 4:00

2 Answers 2

3

create these index before running the query .

create index old_data_id_index 
on old_data
using btree (id);

create index old_data_date_index
 on old_data
using btree(date);

create index new_data_id_index
 on new_data
 using btree(id);


create index new_data_date_index
 on new_data
using btree(date);
Sign up to request clarification or add additional context in comments.

Comments

0

You can try:

delete from old_data removed
using
    (select od.id, od.date
    from old_data od
    left join new_data nd on nd.id=od.id and nd.date=od.date
    where new_data.id is null) as to_remove
where to_remove.id=removed.id and to_remove.date=removed.date and
-- rest of your conditions:
removed.id = my_id  AND  removed.date >= 'my_start_date'  AND  removed.date <= 'my_end_date';

This should avoid scanning new_data table multiple times;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.