Parallel query plans only for SELECT
Postgres only uses parallel plans for SELECT (and some commands based on a SELECT, like CREATE TABLE AS.) Queries modifying any rows cannot be parallelized. The manual:
Even when it is in general possible for parallel query plans to be
generated, the planner will not generate them for a given query if any
of the following are true:
- The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated.
Manual parallelization
To address what you asked: split up the table into N non-overlapping slices along user_pgid, roughly equal in size - where N is the number of processes to employ in parallel. Should not exceed the number of CPU cores available. In total, stay below the I/O capacity of your DB server.
Create a PROCEDURE like:
CREATE OR REPLACE PROCEDURE public.f_upd_message(
_lower_incl bigint
, _upper_excl bigint
, _step int = 50000
)
LANGUAGE plpgsql AS
$proc$
DECLARE
_low bigint;
_upd_ct int;
BEGIN
IF _upper_excl <= _lower_incl OR _step < 1 THEN
RAISE EXCEPTION '_upper_excl must be > _lower_inc & _step > 0! Was: _lower_incl: %, _upper_excl: %, _step: %'
, _lower_incl, _upper_excl, _step;
END IF;
FOR _low IN _lower_incl .. _upper_excl - 1 BY _step
LOOP
RAISE NOTICE 'user_pgid >= % AND user_pgid < %'
, _low, LEAST(_upper_excl, _low + _step); -- optional
UPDATE public.message m
SET user_id = u.id
FROM public."user" u
WHERE m.user_pgid >= _low
AND m.user_pgid < _low + _step
AND m.user_pgid < _upper_excl -- enforce upper bound
AND u.user_pgid = m.user_pgid
AND m.user_id <> u.id; -- ① suppress empty updates
GET DIAGNOSTICS upd_ct = ROW_COUNT; -- optional
COMMIT;
RAISE NOTICE 'Updated % rows', upd_ct; -- optional
END LOOP;
END
$proc$;
Call:
CALL public.f_upd_message(20000000, 30000000);
Or:
CALL public.f_upd_message(20000000, 30000000, 100000);
① Avoid empty updates (where the column value wouldn't change). If user_id can be null, use null-safe comparison with IS DISTINCT FROM.
Also prevents repeated updates in case you have to start over or mess up slices. See:
Base your slices on actual min and max user_pgid:
SELECT min(user_pgid) AS _lower_incl, max(user_pgid) + 1 AS _upper_excl
FROM public.message;
Adjust the _step size to your system. I added a default of 50000.
Then run N separate sessions each processing one slice. Like, start N psql instances (manually or in a shell script).
Each step is committed. If you should still run into a timeout (or any other problems), committed work is not rolled back.
Related:
Side effects, notes
The table will grow up to twice its size because every update leaves a dead tuple behind. You may want to run VACUUM FULL afterwards - if you can afford to do so. Alternatively, issue VACUUM in the procedure in reasonable intervals to make space from dead tuples available for reuse ...
See:
Various other optimizations are possible. Like drop and later recreate FK constraints, indexes, ... But you absolutely need an index on message(user_pgid) for this!
If you are at liberty to do so, create an updated (sorted?) copy of the table instead of updating all rows. Like Frank already suggested. That gives you a pristine (clustered) table without bloat.
bigserialtype.. so I suppose I could script & chunk it.