1

2 weeks after migrating to Postgres 17.2 we started getting an error on a query that worked flawlessly for years on Postgres 14. I suspected this could be related to the configuration parameters of the database or the major version, but in theory we are using the default values that come set up in both versions. We are running a 64Gb ram postgres RDS instance and we are doing joins between tables with dozens of millions of records in a data flow. We managed to isolate the issue to a query that does 2 regular outer joins on indexed columns.

SELECT *
FROM company
LEFT OUTER JOIN company_s
  ON company.domain = company_s.domain
LEFT OUTER JOIN socials
  ON company_s.raw_domain = socials.domain

This query returns normally in under 4 minutes. But in this case it runs from 5 to 11 minutes and then produces the error

invalid DSA memory alloc request size 1811939328

It is quite odd that the flow ran without issue for the first 2 weeks then stoped working while using the same data.

This shows up in our identical staging and production environments, in both cases the instance abnormally uses all its memory (usually there are more than 10GB free when running the flows), however in staging it tends to fail after 11 minutes sometimes accompanied by SSL SYSCALL Error: EOF detected. In production it fails instead after 5 minutes with : invalid DSA memory alloc request size 1811939328

enter image description here What we tried

  1. Trimming Oversized Entries
  2. Rebuilding all Indices
  3. Running analyze on all the involved tables
  4. Running on a reduced sample of only 1MM entries works, but that is not a real solution to our problem
  5. Incrementing shared_buffers to 32GB

There is a bug report with the same error as we have related to PG 17 but there is no solution info related https://www.postgresql.org/message-id/18349-83d33dd3d0c855c3%40postgresql.org

There are a few questions with issues regarding the same problem, mostly without answers or with answers that do not apply to our problem

1
  • I was able to fix this error by running ANALYZE, postgres 16.3 Commented Mar 14 at 5:22

1 Answer 1

2

Following the lead of the possible bug report on PG17, we ran the queries with

SET max_parallel_workers_per_gather = 0

and that made the query return in 5 minutes without errors. Digging deeper we decided to review the work_mem, which was set by default to 4MB by RDS. We updated this value in the configuration to 64MB based loosely on this parameter guide, and it started working smoothly, returning in around 3 minutes.

SET work_mem TO '64MB';

The working explain analyze looks like this

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

The first remedy is clear: "Dynamic shared memory" segments are used to exchange data in parallel query, so if you disable the feature the error goes away. The second remedy probably works because PostgreSQL chooses a different plan that avoids parallel query (use EXPLAIN to check).
On the second case, the query is also ran in parallel. I don't have a good explanation as to why the small work_mem triggered the alloc request error. I updated the answer with the new explain analyze we currently have with the updated work_mem in 64MB
Hard to say, because you don't show the original, failing plan. By the way, formatted text works much better than an image: you can copy and paste the text. As it is, I have no good explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.