Skip to content

Data inconsistency (missing records) during node restart despite synchronous_commit = 'on' / remote_write #309

@umutoguz

Description

@umutoguz

Description
I am running a 2-node Active-Active setup using pgactive on PostgreSQL 18. I observed a data consistency issue (data drift) when one of the nodes is restarted while under load, even though Synchronous Replication is configured.
Specifically, when I stop the target node (Node B) using systemctl stop, the primary node (Node A) successfully waits (blocks). However, upon restarting Node B, I consistently find that 1 or 2 records exist on Node A but are missing on Node B, despite synchronous_commit being set to 'on' or 'remote_write'.
The issue is resolved only when I switch to synchronous_commit = 'remote_apply', but I would expect on (flush) or remote_write to guarantee data safety without data loss/drift, as the Primary receives an ACK.

Environment

  • PostgreSQL Version: 18.1
  • Extension: pgactive 2.1.7
  • OS: RHEL / Linux
  • Setup: 2 Nodes (Bidirectional Replication)

Configuration (Node A)
ALTER SYSTEM SET synchronous_standby_names = 'ANY 1 (*)';
ALTER SYSTEM SET synchronous_commit = 'on'; -- Issue also persists with 'remote_write'

Steps to Reproduce

  • Set up bidirectional replication between Node A and Node B.
  • Enable synchronous replication as shown above.
  • Start a workload on Node A using pgbench (inserts).
    pgbench -h ... -c 50 -j 4 -T 15000 -R 15000 ...
  • While the workload is running, perform a graceful shutdown on Node B:
    systemctl stop postgresql-18
  • Node A correctly blocks/hangs (waiting for sync standby).
  • Start Node B again (systemctl start ...).
  • Compare row counts between Node A and Node B.
    • Result: Node A count > Node B count (usually 1 or 2 records difference).

Logs & Analysis

It appears that the Apply Worker on Node B is terminated during shutdown before it commits the transaction to the table, even though the WAL has been flushed and ACKed to Node A.

Logs from Node B (Shutdown time):

LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating background worker "pgactive apply worker" due to administrator command
...
LOG: background worker "pgactive apply worker" ... exited with exit code 1

Logs from Node B (Startup time):

The node recovers, but it seems to skip the uncommitted (aborted) transaction that Node A believes was successful:
LOG: B/9DAFF248 has been already streamed, forwarding to B/9F3F43F0
DETAIL: Streaming transactions committing after B/9F3F43F0...

Question / Expected Behavior

Is this expected behavior for Logical Replication where synchronous_commit = 'on' only guarantees WAL flush but not the application of the data on the subscriber side?
In a physical replication scenario, on is usually sufficient. In this pgactive logical setup, the transaction seems to be rolled back on the subscriber during shutdown, leading to a split-brain scenario where the primary thinks it committed, but the subscriber lost it.
Does pgactive require synchronous_commit = 'remote_apply' strictly for zero-data-loss during restarts, or is there a way to handle the "Apply Worker" shutdown more gracefully to prevent this drift?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions