We are using Debezium connector for PostgreSQL, with Debezium Version 1.3 and PostgreSQL 10.2, with pgoutput plugin. we are facing peculiar problem when doing a select on PostgreSQL which returns lot of data on the tables (for these tables replication is not enabled) and if the query takes long time, WAL retained size keeps on increasing a lot. once the select statement returns the data, WAL size decrease and become normal after some time.
We already have heartbeat query and heartbeat topic configured in Debezium. that is not helping much. when we check the Debezium logs when select statement is running, we see in logs saying offset is getting committed. so i am thinking Debezium is still committing the LSN which its already consumed.
Today we tried to replicate issue and saw when Select query is running, Debezium lost connectivity with one of the Kafka Broker. not sure if that was coincidence or related, but Debezium tried for long time connecting to broker and was able to to after 10+ mins. until it was able to connect to Kafka broker the LSN was not getting committed and WAL size was increasing.
We tried everything that is available on Debezium documentation and link to other articles who posted issues on Debezium, but still not able to find the root cause of WAL size increase.
Due to above issue, we are not able to run any queries in DB that takes more than 5-10 mins. some times it hung such a way, it will never recover, we have to drop the replication slot and recreate it. and few scenarios, we cant even drop the replication slot. we had to restart the DB to kill the process (we are on AWS RDS, so cant kill the process, login into the box).
CDC is heavily used in the project, upgrading to the latest version of Debezium (2.2) and PostgreSQL (14) is in future plans, but looking for solutions on what we can do now.
Questions,
What is link with select query and WAL size increasing?
When Debezium lost connectivity with one of the broker, doesn't it ignore that broker after few seconds and try to re-balance and continue. why it tried forever (i see it took almost 10 mins to reconnect. during that time it tried to reconnect again and again)
we used heart beat topic, heart beat query, but they didn't help much.