Skip to content

Conversation

@bnestere
Copy link
Contributor

@bnestere bnestere commented Oct 15, 2025

The binary log could be corrupted when committing a large transaction
(i.e. one whose data exceeds the binlog_cache_size limit and spills
into a tmp file) in binlog_format=row if the server's --tmp-dir is
full. The corruption that happens is only the GTID of the errored
transaction would be written into the binary log, without any
body/finalizing events. This would happen because the content of the
transaction wasn't flushed at the proper time, and the transaction's
binlog cache data was not durable while trying to copy the content
from the binlog cache file into the binary log itself. While switching
the tmp file from a WRITE_CACHE to a READ_CACHE, the server would see
there is still data to flush in the cache, and first try to flush it.
This is not a valid time to flush that data to the temporary file
though, as:

  1. The GTID event has already been written directly to the binary
    log. So if this flushing fails, it leaves the binary log in a
    corrupted state.

  2. This is done during group commit, and will slow down other
    concurrent transactions, which are otherwise ready to commit.

This patch fixes these issues by ensuring all transaction data is
fully flushed to its temporary file (if used) before starting any
critical paths, i.e. in binlog_flush_cache(). Note that if the binlog
cache is solely in-memory, this flush-to-temporary-file is skipped.

This PR is organized as follows:
Commit 1: The regression that reproduces the bug report
Commit 2: The fix
And any future commits will be addressing reviews.

The binary log could be corrupted when committing a large transaction
(i.e. one whose data exceeds the binlog_cache_size limit and spills
into a tmp file) in binlog_format=row if the server's --tmp-dir is
full. The corruption that happens is only the GTID of the errored
transaction would be written into the binary log, without any
body/finalizing events.  This would happen because the content of the
transaction wasn't flushed at the proper time, and the transaction's
binlog cache data was not durable while trying to copy the content
from the binlog cache file into the binary log itself. While switching
the tmp file from a WRITE_CACHE to a READ_CACHE, the server would see
there is still data to flush in the cache, and first try to flush it.
This is not a valid time to flush that data to the temporary file
though, as the GTID event has already been written directly to the
binary log.  So if this flushing fails, it leaves the binary log in a
corrupted state.

The flush itself is expected to happen in
THD::binlog_flush_pending_rows_event(). However, if there is no pending
event, the flush is skipped.
The binary log could be corrupted when committing a large transaction
(i.e. one whose data exceeds the binlog_cache_size limit and spills
into a tmp file) in binlog_format=row if the server's --tmp-dir is
full. The corruption that happens is only the GTID of the errored
transaction would be written into the binary log, without any
body/finalizing events.  This would happen because the content of the
transaction wasn't flushed at the proper time, and the transaction's
binlog cache data was not durable while trying to copy the content
from the binlog cache file into the binary log itself. While switching
the tmp file from a WRITE_CACHE to a READ_CACHE, the server would see
there is still data to flush in the cache, and first try to flush it.
This is not a valid time to flush that data to the temporary file
though, as:

  1. The GTID event has already been written directly to the binary
     log. So if this flushing fails, it leaves the binary log in a
     corrupted state.

  2. This is done during group commit, and will slow down other
     concurrent transactions, which are otherwise ready to commit.

This patch fixes these issues by ensuring all transaction data is
fully flushed to its temporary file (if used) before starting any
critical paths, i.e. in binlog_flush_cache(). Note that if the binlog
cache is solely in-memory, this flush-to-temporary-file is skipped.

Reviewed-by: TODO
Copy link
Contributor

@andrelkin andrelkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work looks good. Special thanks for a solid analysis!
I have only a kind of cosmetic notes.

Returns TRUE on success, FALSE on error.
*/
static my_bool binlog_cache_reconcile_data_in_storage(IO_CACHE *info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new function purpose is clear, but its introduction as static feels an overkill..

ready-to-commit (concurrent) transactions could be stalled
*/
if (using_stmt && !thd->binlog_flush_pending_rows_event(TRUE, FALSE) &&
binlog_cache_reconcile_data_in_storage(
Copy link
Contributor

@andrelkin andrelkin Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. so instead my_b_get_pos_in_file(info) && flush_io_cache(info) would be a fair enough at this point, as a new inline, why not flush_pending_bytes(info); flushin the name, while reads naturally, is also for consistency with the caller and a sibling's names.
Also the suggested name suggests to me (much) shorter comments that your static function's header ones..
Indeed, the whole point of the caller function is to get all stuff to disk. That, as it turns out in this bug, just deals on few levels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MariaDB Corporation Replication Patches involved in replication

Development

Successfully merging this pull request may close these issues.

3 participants