I need to update one column in a table with huge set of data in the PostgreSQL DB.
Since the job might run for 1 or 2 days continuously due to the large set of data, I need to do this batch wise and commit the transactions batch wise so that I can keep track of the progress and print logs of any batches that fails and run them manually later by providing the failed offset and limit.
One method I tried to do this is the following in the postgres block which failed since I can't use row_number() in where clause.
DO LANGUAGE plpgsql $$
DECLARE
row_count_ integer;
offset_ integer := 0;
batch_size_ integer := 100000;
limit_ integer := offset_ + batch_size_;
total_rows_ integer;
BEGIN
WHILE offset_ < total_rows_ LOOP
limit_ := offset_ + batch_size_;
UPDATE table1
SET column1 = 'Value'
WHERE row_number() over() >= offset_ AND row_number() over() < limit_;
GET DIAGNOSTICS row_count_ = row_count;
RAISE INFO '% rows updated from % to %', row_count_, offset_, limit_;
offset_ := offset_ + batch_size_;
END LOOP;
EXCEPTION WHEN OTHERS THEN
RAISE NOTICE 'Transaction is rolling back, % : %', SQLSTATE, SQLERRM;
ROLLBACK;
END $$;
I'm even ok to do this using a python script but I need to do this the fastest way possible. I went through many articles which uses a select sub query which is too expensive due to the join in my opinion.
Could someone please help me with a better way to achieve this?