0

I have 61 million of non unique emails with statuses. This emails need to deduplicate with logic by status.

I write stored procedure, but this procedure runs to long.

How I can optimize execution time of this procedure?

CREATE OR REPLACE FUNCTION public.load_oxy_emails() RETURNS boolean AS $$

DECLARE
        row record;
        rec record;
        new_id int;
BEGIN
        FOR row IN SELECT * FROM oxy_email ORDER BY id LOOP
                SELECT * INTO rec FROM oxy_emails_clean WHERE email = row.email;
                IF rec IS NOT NULL THEN
                        IF row.status = 3 THEN
                                UPDATE oxy_emails_clean SET status = 3 WHERE id = rec.id;
                        END IF;
                ELSE
                        INSERT INTO oxy_emails_clean(id, email, status) VALUES(nextval('oxy_emails_clean_id_seq'), row.email, row.status);
                        SELECT currval('oxy_emails_clean_id_seq') INTO new_id;
                        INSERT INTO oxy_emails_clean_websites_relation(oxy_emails_clean_id, website_id) VALUES(new_id, row.website_id);
                END IF;
        END LOOP;
        RETURN true;
END;
$$
LANGUAGE 'plpgsql';
1
  • 1
    How I can optimize execution time of this procedure?By not using a procedure with a cursor/loop. Instead, you can use two separate SQL statements (maybe glued together by a chained CTE) Commented Feb 8, 2017 at 15:25

1 Answer 1

4

How I can optimize execution time of this procedure?

Don't do it with a loop.

Doing a row-by-row processing (also known as "slow-by-slow") is almost always a lot slower then doing bulk changes where a single statement processes a lot of rows "in one go".

The change of the status can easily be done using a single statement:

update oxy_emails_clean oec
    SET status = 3
from oxy_email oe
where oe.id = oec.id
  and oe.status = 3;

The copying of the rows can be done using a chain of CTEs:

with to_copy as (
  select *
  from oxy_email 
  where status <> 3 --<< all those that have a different status
), clean_inserted as (
  INSERT INTO oxy_emails_clean (id, email, status) 
  select nextval('oxy_emails_clean_id_seq'), email, status
  from to_copy
  returning id;
) 
insert oxy_emails_clean_websites_relation (oxy_emails_clean_id, website_id)
select ci.id, tc.website_id
from clean_inserted ci
  join to_copy tc on tc.id = ci.id;
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.