0

I need to update one column in a table with huge set of data in the PostgreSQL DB.

Since the job might run for 1 or 2 days continuously due to the large set of data, I need to do this batch wise and commit the transactions batch wise so that I can keep track of the progress and print logs of any batches that fails and run them manually later by providing the failed offset and limit.

One method I tried to do this is the following in the postgres block which failed since I can't use row_number() in where clause.

DO LANGUAGE plpgsql $$
DECLARE

    row_count_  integer;

    offset_     integer := 0;
    batch_size_ integer := 100000;
    limit_      integer := offset_ + batch_size_;

    total_rows_ integer;

BEGIN

    WHILE offset_ < total_rows_ LOOP
        limit_ := offset_ + batch_size_;
    
        UPDATE table1 
            SET column1 = 'Value' 
            WHERE row_number() over() >= offset_ AND row_number() over() < limit_;
        GET DIAGNOSTICS row_count_ = row_count;
        RAISE INFO '% rows updated from % to %', row_count_, offset_, limit_;
    
        offset_ := offset_ + batch_size_;
    END LOOP;

EXCEPTION WHEN OTHERS THEN 

    RAISE NOTICE 'Transaction is rolling back, % : %', SQLSTATE, SQLERRM;
    ROLLBACK;
   
END $$;

I'm even ok to do this using a python script but I need to do this the fastest way possible. I went through many articles which uses a select sub query which is too expensive due to the join in my opinion.

Could someone please help me with a better way to achieve this?

1
  • If every row is updated with the same value, you could just add the column with that as the default. This will be nearly instant in v11 or up. (And before 11--upgrade to at least 11.) Commented Oct 28, 2021 at 12:41

1 Answer 1

2

If the activity lasts for several days, doing the UPDATE in batches makes sense. You may want to run an explicit VACUUM on the table between batches to avoid table bloat.

About your core problem, I would say that the simplest solution would be to batch by primary key values, that is, run statements like:

UPDATE tab
SET col = newval
WHERE id <= 100000
  AND /* additional criteria*/;

VACUUM tab;

UPDATE tab
SET col = newval
WHERE id > 100000 AND id <= 200000
  AND /* additional criteria*/;

...

Keep repeating that until you reach the maximum id.

Sign up to request clarification or add additional context in comments.

2 Comments

is it possible to update the batches (non-overlapping) concurrently?
@leo.b. That is possible, but then you will get bloat. Serializing the batches is deliberate here: see the VACUUM between the steps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.