Data inconsistency (missing records) during node restart despite synchronous_commit = 'on' / remote_write

Description
I am running a 2-node Active-Active setup using pgactive on PostgreSQL 18. I observed a data consistency issue (data drift) when one of the nodes is restarted while under load, even though Synchronous Replication is configured.
Specifically, when I stop the target node (Node B) using systemctl stop, the primary node (Node A) successfully waits (blocks). However, upon restarting Node B, I consistently find that 1 or 2 records exist on Node A but are missing on Node B, despite synchronous_commit being set to 'on' or 'remote_write'.
The issue is resolved only when I switch to synchronous_commit = 'remote_apply', but I would expect on (flush) or remote_write to guarantee data safety without data loss/drift, as the Primary receives an ACK.

Environment
 * PostgreSQL Version: 18.1
 * Extension: pgactive 2.1.7
 * OS: RHEL / Linux
 * Setup: 2 Nodes (Bidirectional Replication)

Configuration (Node A)
ALTER SYSTEM SET synchronous_standby_names = 'ANY 1 (*)';
ALTER SYSTEM SET synchronous_commit = 'on'; -- Issue also persists with 'remote_write'

Steps to Reproduce

 * Set up bidirectional replication between Node A and Node B.
 * Enable synchronous replication as shown above.
 * Start a workload on Node A using pgbench (inserts).
   pgbench -h ... -c 50 -j 4 -T 15000 -R 15000 ...
 * While the workload is running, perform a graceful shutdown on Node B:
   systemctl stop postgresql-18
 * Node A correctly blocks/hangs (waiting for sync standby).
 * Start Node B again (systemctl start ...).
 * Compare row counts between Node A and Node B.
   * Result: Node A count > Node B count (usually 1 or 2 records difference).
   
Logs & Analysis

It appears that the Apply Worker on Node B is terminated during shutdown before it commits the transaction to the table, even though the WAL has been flushed and ACKed to Node A.

Logs from Node B (Shutdown time):

LOG:  received fast shutdown request
LOG:  aborting any active transactions
FATAL:  terminating background worker "pgactive apply worker" due to administrator command
...
LOG:  background worker "pgactive apply worker" ... exited with exit code 1

Logs from Node B (Startup time):

The node recovers, but it seems to skip the uncommitted (aborted) transaction that Node A believes was successful:
LOG:  B/9DAFF248 has been already streamed, forwarding to B/9F3F43F0
DETAIL:  Streaming transactions committing after B/9F3F43F0...

Question / Expected Behavior

Is this expected behavior for Logical Replication where synchronous_commit = 'on' only guarantees WAL flush but not the application of the data on the subscriber side?
In a physical replication scenario, on is usually sufficient. In this pgactive logical setup, the transaction seems to be rolled back on the subscriber during shutdown, leading to a split-brain scenario where the primary thinks it committed, but the subscriber lost it.
Does pgactive require synchronous_commit = 'remote_apply' strictly for zero-data-loss during restarts, or is there a way to handle the "Apply Worker" shutdown more gracefully to prevent this drift?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data inconsistency (missing records) during node restart despite synchronous_commit = 'on' / remote_write #309

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Data inconsistency (missing records) during node restart despite synchronous_commit = 'on' / remote_write #309

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions