Update a very large table in PostgreSQL without locking

Question

I have a very large table with 100M rows in which I want to update a column with a value on the basis of another column. The example query to show what I want to do is given below:

UPDATE mytable SET col2 = 'ABCD'
WHERE col1 is not null

This is a master DB in a live environment with multiple slaves and I want to update it without locking the table or effecting the performance of the live environment. What will be the most effective way to do it? I'm thinking of making a procedure that update rows in batches of 1000 or 10000 rows using something like limit but not quite sure how to do it as I'm not that familiar with Postgres and its pitfalls. Oh and both columns don't have any indexes but table has other columns that has.

I would appreciate a sample procedure code.

Thanks.

What does col2 start out as? Are other processes also updating it at the same time? Is it really 'ABC', or something row-specific? Are you worried about table-level locks or row-level locks? Have you already tried the simple solution and found it untenable, or are you worrying preemptively? — jjanes
– jjanes, Commented Dec 11, 2019 at 14:41
Are you aware that even if that update locks many rows (not the table!) those locks will not prevent SELECT or INSERT statements on that table — user330315
– user330315, Commented Dec 11, 2019 at 14:49
I haven't tried it yet and I'm worried preemptively because it's quite crucial for the function of the application. col2 is null and I should only set it if it's null. Other process won't be updating it only insert a row with some values. — Dashing Boy
– Dashing Boy, Commented Dec 11, 2019 at 16:10
@DashingBoy: then you have no problems. The UPDATE will put some I/O load on the system, but the locks won't bother you — user330315
– user330315, Commented Dec 11, 2019 at 22:37

Laurenz Albe · Accepted Answer · 2019-12-11 16:21:06Z

10

There is no update without locking, but you can strive to keep the row locks few and short.

You could simply run batches of this:

UPDATE mytable
SET col2 = 'ABCD'
FROM (SELECT id
      FROM mytable
      WHERE col1 IS NOT NULL
        AND col2 IS DISTINCT FROM 'ABCD'
      LIMIT 10000) AS part
WHERE mytable.id = part.id;

Just keep repeating that statement until it modifies less than 10000 rows, then you are done.

Note that mass updates don't lock the table, but of course they lock the updated rows, and the more of them you update, the longer the transaction, and the greater the risk of a deadlock.

To make that performant, an index like this would help:

CREATE INDEX ON mytable (col2) WHERE col1 IS NOT NULL;

edited Dec 11, 2019 at 16:21

answered Dec 11, 2019 at 14:42

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

jjanes Over a year ago

It should probably be AND col2 is distinct from 'ABCD' in the update WHERE. I think I would also add that to the index WHERE as well.

Laurenz Albe Over a year ago

In the WHERE clause it is only good if the constant is - well - constant.

Dashing Boy Over a year ago

One thing I didn't understand is, will this query be enough? No loop? Should I encapsulate it inside a procedure with a loop? I think I didn't quite get the distinct from clause.

Laurenz Albe Over a year ago

I have added usage instructions to make it clearer. col2 IS DISTINCT FROM 'ABCD' is the opposite of col2 = 'ABCD', so that no row is updated twice.

Dashing Boy Over a year ago

I think I didn't really elaborate it. How can I make sure that batches are complete if I run it in a loop. How do I write an exit condition and also will it be in a single transaction or should I split each batches in different transaction. Sorry if it's all very obvious and I'm not grasping it.

|

Belayer · Accepted Answer · 2019-12-12 15:14:03Z

Just an off-the-wall, out-of-the-box idea. Both col1 and col2 must be null to qualify precludes using an index, perhaps building a psudo index might be an option. This index would of course be a regular table but would only exist for a short period. Additionally, this relieves the lock time worry.

create table indexer (mytable_id integer  primary key);

insert into indexer(mytable_id)
select mytable_id
  from mytable
 where col1 is null
   and col2 is null;

The above creates our 'index' that contains only the qualifying rows. Now wrap an update/delete statement into an SQL function. This function updates the main table and deleted the updated rows from the 'index' and returns the number of rows remaining.

create or replace function set_mytable_col2(rows_to_process_in integer)
returns bigint
language sql
as $$
    with idx as
       ( update mytable
            set col2 = 'ABCD'
          where col2 is null
            and mytable_id in (select mytable_if 
                                 from indexer
                                limit rows_to_process_in
                               )
         returning mytable_id
       )
    delete from indexer
     where mytable_id in (select mytable_id from idx);

    select count(*) from indexer;
$$;

When the functions returns 0 all rows initially selected have been processed. At this point repeat the entire process to pickup any rows added or updated which the initial selection didn't identify. Should be small number, and process is still available needed later.
Like I said just an off-the-wall idea.

Edited Must have read into it something that wasn't there concerning col1. However the idea remains the same, just change the INSERT statement for 'indexer' to meet your requirements. As far as setting it in the 'index' no the 'index' contains a single column - the primary key of the big table (and of itself).
Yes you would need to run multiple times unless you give it the total number rows to process as the parameter. The below is a DO block that would satisfy your condition. It processes 200,000 on each pass. Change that to fit your need.

Do $$
declare 
    rows_remaining bigint;
begin    
loop
    rows_remaining = set_mytable_col2(200000);
    commit;
    exit when rows_remaining = 0;
end loop;
end; $$;

I needed to run the function over and over again, right? Is there a way I can do it in one function call with multiple commits using loop so I don't have to manually run it again? Also col1 is not null based on my condition, I'm assuming you setting it as null for index table is a type?

Collectives™ on Stack Overflow

Update a very large table in PostgreSQL without locking

2 Answers 2

8 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related