2

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:

UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null

This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.

I would appreciate a sample procedure code.

Thanks.

4
  • What does col2 start out as? Are other processes also updating it at the same time? Is it really 'ABC', or something row-specific? Are you worried about table-level locks or row-level locks? Have you already tried the simple solution and found it untenable, or are you worrying preemptively? Commented Dec 11, 2019 at 14:41
  • 1
    Are you aware that even if that update locks many rows (not the table!) those locks will not prevent SELECT or INSERT statements on that table Commented Dec 11, 2019 at 14:49
  • I haven't tried it yet and I'm worried preemptively because it's quite crucial for the function of the application. col2 is null and I should only set it if it's null. Other process won't be updating it only insert a row with some values. Commented Dec 11, 2019 at 16:10
  • 1
    @DashingBoy: then you have no problems. The UPDATE will put some I/O load on the system, but the locks won't bother you Commented Dec 11, 2019 at 22:37

2 Answers 2

10

There is no update without locking, but you can strive to keep the row locks few and short.

You could simply run batches of this:

UPDATE mytable
SET col2 = 'ABCD'
FROM (SELECT id
      FROM mytable
      WHERE col1 IS NOT NULL
        AND col2 IS DISTINCT FROM 'ABCD'
      LIMIT 10000) AS part
WHERE mytable.id = part.id;

Just keep repeating that statement until it modifies less than 10000 rows, then you are done.

Note that mass updates don't lock the table, but of course they lock the updated rows, and the more of them you update, the longer the transaction, and the greater the risk of a deadlock.

To make that performant, an index like this would help:

CREATE INDEX ON mytable (col2) WHERE col1 IS NOT NULL;
Sign up to request clarification or add additional context in comments.

8 Comments

It should probably be AND col2 is distinct from 'ABCD' in the update WHERE. I think I would also add that to the index WHERE as well.
In the WHERE clause it is only good if the constant is - well - constant.
One thing I didn't understand is, will this query be enough? No loop? Should I encapsulate it inside a procedure with a loop? I think I didn't quite get the distinct from clause.
I have added usage instructions to make it clearer. col2 IS DISTINCT FROM 'ABCD' is the opposite of col2 = 'ABCD', so that no row is updated twice.
I think I didn't really elaborate it. How can I make sure that batches are complete if I run it in a loop. How do I write an exit condition and also will it be in a single transaction or should I split each batches in different transaction. Sorry if it's all very obvious and I'm not grasping it.
|
2

Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.

create table indexer (mytable_id integer  primary key);

insert into indexer(mytable_id)
select mytable_id
  from mytable
 where col1 is null
   and col2 is null;

The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.

create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
    with idx as
       ( update mytable
            set col2 = 'ABCD'
          where col2 is null
            and mytable_id in (select mytable_if 
                                 from indexer
                                limit rows_to_process_in
                               )
         returning mytable_id
       )
    delete from indexer
     where mytable_id in (select mytable_id from idx);

    select count(*) from indexer;
$$; 

When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.

Edited Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.

Do $$
declare 
    rows_remaining bigint;
begin    
loop
    rows_remaining = set_mytable_col2(200000);
    commit;
    exit when rows_remaining = 0;
end loop;
end; $$; 

1 Comment

I needed to run the function over and over again, right? Is there a way I can do it in one function call with multiple commits using loop so I don't have to manually run it again? Also col1 is not null based on my condition, I'm assuming you setting it as null for index table is a type?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.