How to update many rows at once in parallel with postgres query?

Question

I have the following postgres query that's meant to update all message entries:

update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid

The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.

How do I speed up this simple update command to make it as fast as possible?

the type in the first clause is a UUID, and in the second is i64.

any other column that identifies each row? are they sequentially generated? if yes, you can update in "chunk" using that PK. like update row with id 1 to 50 then row 51 to 100, etc. you can run all of those query in parallel. oh, dont forget indexes. — Bagus Tesa
– Bagus Tesa, Commented Dec 22, 2023 at 1:19
@BagusTesathe _pgid fields are technically bigserial type.. so I suppose I could script & chunk it. — tuskiomi
– tuskiomi, Commented Dec 22, 2023 at 1:40
If you have to update every row, you might be better off creating a new fresh table. — Frank Heikens
– Frank Heikens, Commented Dec 22, 2023 at 2:47

Erwin Brandstetter · Accepted Answer · 2023-12-22 22:47:39Z

4

Parallel query plans only for `SELECT`

Postgres only uses parallel plans for SELECT (and some commands based on a SELECT, like CREATE TABLE AS.) Queries modifying any rows cannot be parallelized. The manual:

Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:

The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within a CTE, no parallel plans for that query will be generated.

Manual parallelization

To address what you asked: split up the table into N non-overlapping slices along user_pgid, roughly equal in size - where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.

Create a PROCEDURE like:

CREATE OR REPLACE PROCEDURE public.f_upd_message(
     _lower_incl bigint
   , _upper_excl bigint
   , _step int = 50000
   )
  LANGUAGE plpgsql AS
$proc$
DECLARE
   _low bigint;
   _upd_ct int;
BEGIN
   IF _upper_excl <= _lower_incl OR _step < 1 THEN
      RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
                     , _lower_incl, _upper_excl, _step;
   END IF;
   
   FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
   LOOP
      RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
                  , _low, LEAST(_upper_excl, _low + _step);  -- optional

      UPDATE public.message m
      SET    user_id = u.id
      FROM   public."user" u
      WHERE  m.user_pgid >= _low
      AND    m.user_pgid <  _low + _step
      AND    m.user_pgid <  _upper_excl  -- enforce upper bound
      AND    u.user_pgid = m.user_pgid
      AND    m.user_id <> u.id;          -- ① suppress empty updates
      
      GET DIAGNOSTICS upd_ct = ROW_COUNT;  -- optional

      COMMIT;

      RAISE NOTICE 'Updated % rows', upd_ct;  -- optional
   END LOOP;
END
$proc$;

Call:

CALL public.f_upd_message(20000000, 30000000);

Or:

CALL public.f_upd_message(20000000, 30000000, 100000);

① Avoid empty updates (where the column value wouldn't change). If user_id can be null, use null-safe comparison with IS DISTINCT FROM.
Also prevents repeated updates in case you have to start over or mess up slices. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

Base your slices on actual min and max user_pgid:

SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
FROM   public.message;

Adjust the _step size to your system. I added a default of 50000.

Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).

Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.

The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL afterwards - if you can afford to do so. Alternatively, issue VACUUM in the procedure in reasonable intervals to make space from dead tuples available for reuse ... See:

VACUUM returning disk space to operating system

Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, ... But you absolutely need an index on message(user_pgid) for this!

If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.

edited Dec 22, 2023 at 22:47

answered Dec 22, 2023 at 5:24

Erwin Brandstetter

669k160 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

tuskiomi Over a year ago

This is a lot of new stuff to me :) Thank you for the detailed answer!

Tom Grushka Over a year ago

I don't believe you would actually do a FOR ... LOOP if you intend to run this function in separate sessions. The FOR ... LOOP does not parallelize at all, but rather runs each range in series.

Erwin Brandstetter Over a year ago

@TomGrushka: Seems like you stopped reading halfway into the answer. The loop is only for committing chunks within the same session. You run multiple sessions in parallel ...

Tom Grushka Over a year ago

Seems like an over-complication of the original question. The question was how to update many rows in parallel. The first 2/3 of your answer addresses an entirely different question. It makes it difficult and confusing for people when a different question is answered before the original question.

Collectives™ on Stack Overflow

How to update many rows at once in parallel with postgres query?

1 Answer 1

Parallel query plans only for `SELECT`

Manual parallelization

Side effects, notes

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Parallel query plans only for SELECT

Manual parallelization

Side effects, notes

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Parallel query plans only for `SELECT`