1

I have the following postgres query that's meant to update all message entries:

update message m set user_id = u.id from "user" u where u.user_pgid = m.user_pgid

The trouble is that I have ~300,000,000 message records, and postgres seems to be using only one core for this massive update, causing an IO timeout before the operation can complete.

How do I speed up this simple update command to make it as fast as possible?

the type in the first clause is a UUID, and in the second is i64.

4
  • 1
    any other column that identifies each row? are they sequentially generated? if yes, you can update in "chunk" using that PK. like update row with id 1 to 50 then row 51 to 100, etc. you can run all of those query in parallel. oh, dont forget indexes. Commented Dec 22, 2023 at 1:19
  • @BagusTesathe _pgid fields are technically bigserial type.. so I suppose I could script & chunk it. Commented Dec 22, 2023 at 1:40
  • 2
    If you have to update every row, you might be better off creating a new fresh table. Commented Dec 22, 2023 at 2:47
  • @FrankHeikens how is this? Commented Dec 22, 2023 at 3:44

1 Answer 1

4

Parallel query plans only for SELECT

Postgres only uses parallel plans for SELECT (and some commands based on a SELECT, like CREATE TABLE AS.) Queries modifying any rows cannot be parallelized. The manual:

Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:

  • The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within a CTE, no parallel plans for that query will be generated.

Manual parallelization

To address what you asked: split up the table into N non-overlapping slices along user_pgid, roughly equal in size - where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.

Create a PROCEDURE like:

CREATE OR REPLACE PROCEDURE public.f_upd_message(
     _lower_incl bigint
   , _upper_excl bigint
   , _step int = 50000
   )
  LANGUAGE plpgsql AS
$proc$
DECLARE
   _low bigint;
   _upd_ct int;
BEGIN
   IF _upper_excl <= _lower_incl OR _step < 1 THEN
      RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
                     , _lower_incl, _upper_excl, _step;
   END IF;
   
   FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
   LOOP
      RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
                  , _low, LEAST(_upper_excl, _low + _step);  -- optional

      UPDATE public.message m
      SET    user_id = u.id
      FROM   public."user" u
      WHERE  m.user_pgid >= _low
      AND    m.user_pgid <  _low + _step
      AND    m.user_pgid <  _upper_excl  -- enforce upper bound
      AND    u.user_pgid = m.user_pgid
      AND    m.user_id <> u.id;          -- ① suppress empty updates
      
      GET DIAGNOSTICS upd_ct = ROW_COUNT;  -- optional

      COMMIT;

      RAISE NOTICE 'Updated % rows', upd_ct;  -- optional
   END LOOP;
END
$proc$;

Call:

CALL public.f_upd_message(20000000, 30000000);

Or:

CALL public.f_upd_message(20000000, 30000000, 100000);

① Avoid empty updates (where the column value wouldn't change). If user_id can be null, use null-safe comparison with IS DISTINCT FROM.
Also prevents repeated updates in case you have to start over or mess up slices. See:

Base your slices on actual min and max user_pgid:

SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
FROM   public.message;

Adjust the _step size to your system. I added a default of 50000.

Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).

Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.

Related:

Side effects, notes

The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL afterwards - if you can afford to do so. Alternatively, issue VACUUM in the procedure in reasonable intervals to make space from dead tuples available for reuse ... See:

Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, ... But you absolutely need an index on message(user_pgid) for this!

If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.

Sign up to request clarification or add additional context in comments.

4 Comments

This is a lot of new stuff to me :) Thank you for the detailed answer!
I don't believe you would actually do a FOR ... LOOP if you intend to run this function in separate sessions. The FOR ... LOOP does not parallelize at all, but rather runs each range in series.
@TomGrushka: Seems like you stopped reading halfway into the answer. The loop is only for committing chunks within the same session. You run multiple sessions in parallel ...
Seems like an over-complication of the original question. The question was how to update many rows in parallel. The first 2/3 of your answer addresses an entirely different question. It makes it difficult and confusing for people when a different question is answered before the original question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.