1

I am successfully using logical replication between 2 PG 11 cloud VMs for latest data. But I tried to publish also some older tables to transfer data between databases and got strange error about missing WAL segment.

These older partitions contain data 5-6 days old. I successfully published them on master and refreshed subscription on logical replica. But now I am getting these strange error messages on logical replica:

2019-01-21 15:03:14.713 UTC [17203] LOG:  logical replication table synchronization worker for subscription "mysubscription", table "mytable_20190115" has finished
2019-01-21 15:03:19.768 UTC [18877] LOG:  logical replication apply worker for subscription "mysubscription" has started
2019-01-21 15:03:19.797 UTC [18877] ERROR:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000098E000000CB has already been removed
2019-01-21 15:03:19.799 UTC [29534] LOG:  background worker "logical replication worker" (PID 18877) exited with exit code 1
2019-01-21 15:03:24.806 UTC [18910] LOG:  logical replication apply worker for subscription "mysubscription" has started
2019-01-21 15:03:24.824 UTC [18911] LOG:  logical replication table synchronization worker for subscription "mysubscription", table "mytable_20190116" has started
2019-01-21 15:03:24.831 UTC [18910] ERROR:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000098E000000CB has already been removed
2019-01-21 15:03:24.834 UTC [29534] LOG:  background worker "logical replication worker" (PID 18910) exited with exit code 1

Which is confusing for me. I tried to find some info but did not find anything about logical replication depending on WAL segments.

There is no streaming replication running on that particular master and these error messages I see on both master and replica connected with only logical replication.

Am I doing something wrong? Is there some special way how to publish older data? For newer data and latest data all works without problems.

Of course since I published like ~20 tables it took some time for replica to process all tables - currently it processes always 2 at the time. But I still do not understand why it should depend on WAL segments... Thank you very much.

UPDATE: I tried to unpublished and unsubscribe these older tables and publish and subscribe them again but getting still the same error message for the exactly the same WAL segment number.

UPDATE 2: I unpublished and unsubscribe those problematic tables and error messages stopped so they were definitely related to logical replication. Could they be caused by snapshot?

UPDATE 3: I just made additional strange experience with WAL segments errors - my logical replica had only quite small disk and during all that fiddling I forgot to check disk usage. So postgresql on logical replica crashed due to full disk. Since I use GCE I just resized root disk and after restart of the instance got more space. But I also got back missing WAL segments errors in connections with logical replication. My postgresql log on replica is now full of sequence of these 3 lines:

2019-01-22 09:47:14.408 UTC [1946] LOG:  logical replication apply worker for subscription "mysubscription" has started
2019-01-22 09:47:14.429 UTC [1946] ERROR:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000099D0000007A has already been removed
2019-01-22 09:47:14.431 UTC [737] LOG:  background worker "logical replication worker" (PID 1946) exited with exit code 1

Why logical replication depends on WAL segments?

1 Answer 1

1

So I found what was wrong thanks to clever people on pgsql-general mailing list.

  1. Logical replication really depends on WAL segments - https://www.postgresql.org/docs/11/logical-replication-architecture.html - changes are distributed using WAL segments - this is why parameter "wal_level" must by set to "logical" on master.

  2. My problem with WAL segments was combination of these circumstances:

    • I tried to publish and subscribe all our huge tables together - for explanation we have like 500 millions records daily, biggest table has daily partition ~30 GB, others 1 - 5 GB
    • PostgreSQL in such case creates snapshot and after subscription is activated starts to transfer data from snapshot to the replica. Only after whole snapshot is transferred walsender will start to send WAL logs for latest changes
    • Since I published like 200 GB of data for several days at once you can imagine transfer took very long time - for transfer 2 new logical replication slots are created and data are transferred to replica using 2 walsender.
    • It would generally work well but we have emergency cronjob which deletes WAL logs which are too outdated because in the past we had problems with almost full disk. And this was the problem I encountered - emergency cronjob deleted WAL segments which were not transferred yet to replica. So generally it is necessary to have enough disk space to be able to store for time being much more WAL logs then normally. Which we did not have previously - but I changed it.
  3. Jeremy Finzel from pgsql-general suggested I should actually use different way for replicating data from master - publish and subscribe only one table at the time and give replica time to sync data. which I did and now logical replication works like a charm...

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.