3

We're running an AWS Aurora PostgreSQL version 13 database. The other day, I was trying to set up a backup job to run from the read replica, and I ended up encountering the error "User was holding a relation lock for too long." I worked around it using the answer in this question.

My question is theoretical, and forgive me if it sounds stupid, but I'm missing something here. If Postgres has MVCC instead of locking, why would pg_dump care about a user "locking" another relation? Shouldn't it just be reading the last version of the row?

1 Answer 1

4

This is about table locks and streaming replication conflicts. My answer is about PostgreSQL and will apply to Amazon Aurora only to the (unknown) extent that it behaves like PostgreSQL.

pg_dump has to read the tables, and reading a table requires an ACCESS SHARE lock on the table. Such a lock conflicts only with activities like DROP TABLE, TRUNCATE, CLUSTER, VACUUM (FULL) and certain variants of ALTER TABLE. An ACCESS SHARE lock does not block writers, it prevents concurrent sessions from deleting the data file you are currently reading.

Now if you have a long running query like pg_dump on the standby and somebody TRUNCATEs a table on the primary, PostgreSQL will try to replay the statement and the associated ACCESS EXCLUSIVE lock on the standby. This will conflict with the long running query, and if pg_dump is not done after max_standby_streaming_delay has passed, the query is canceled and pg_dump terminates with an error.

Note that the conflicts need not be with one of the above statements: if autovacuum processes a table on the primary, and the last couple of pages in the table become empty, VACUUM will try to remove these pages, which also requires a brief ACCESS EXCLUSIVE lock on the table. This does not disrupt processing on the primary, but may lead to queries being canceled on the standby.

Set max_standby_streaming_delay to -1 on the standby server to avoid the problem.

Here is an article that deals with the problem in more detail.

2
  • How is the max_standby_streaming_delay implemented internally? Is there a some kind of fair queue when new read-only queries wait for apply, or we may get a situation when the apply fails to promote itself due to large stream of read-only queries? Commented Oct 23, 2023 at 1:56
  • @AndreyB.Panfilov Interesting question. I just tested it, and locks will queue as usual: a query that starts running on the standby after the startup process has started to wait for the lock to apply a TRUNCATE gets blocked. So a steady stream of queries won't block replication forever, only until all queries that started before the lock are done. Commented Oct 23, 2023 at 2:44

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.