From 5fb8db6887a3089deec911a0c1078801f3acc74b Mon Sep 17 00:00:00 2001 From: Kamil Holubicki Date: Fri, 15 May 2026 10:16:08 +0200 Subject: [PATCH] PS-11137 fix: binlog-server fails to reconnect in position-based replication mode when network timeout interrupts a transaction https://perconadev.atlassian.net/browse/PS-11137 Always trim the in-memory event buffer to the last completed transaction boundary on connection termination, regardless of the replication mode (GTID-based or position-based). Previously, 'storage::discard_incomplete_transaction_events()' was invoked only when 'storage::is_in_gtid_replication_mode()' returned true, based on the assumption that in position-based mode it is safe to resume streaming from an arbitrary mid-transaction byte offset. This assumption turned out to be incorrect: on reconnect, 'reader_context' always expects the first logical event delivered after the pseudo-preamble (FDE + optional rotate) to be one of 'anonymous_gtid_log' / 'gtid_log' / 'gtid_tagged_log' (see 'reader_context::process_event_in_gtid_log_expected_state'). If a 'mysql_binlog_fetch()' timeout / network error fires after the source delivered 'anonymous_gtid_log' but before delivering the corresponding 'BEGIN' query event, the persisted stream offset would point at 'BEGIN', and the next 'mysql_binlog_open()' attempt would resume the stream mid-transaction, immediately triggering the "expected gtid_log_event-like event" assertion in 'reader_context' and terminating PBS. Calling 'discard_incomplete_transaction_events()' unconditionally rewinds the in-memory buffer (and therefore the offset persisted to the metadata file) back to the previous transaction boundary, so reconnection in position-based mode always restarts the stream from a position where the next event is a GTID-style event, matching the reader state machine expectations. --- src/app.cpp | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/src/app.cpp b/src/app.cpp index c47bffc..7815ffc 100644 --- a/src/app.cpp +++ b/src/app.cpp @@ -908,16 +908,11 @@ void receive_binlog_events( "fetch operation did not reach EOF reading binlog events"); } - // in GTID-based replication mode we also need to discard some data in the - // transaction event buffer to make sure that upon reconnection we will - // continue operation from the transaction boundary - - // in position-based replication mode this is not needed as it is not a - // problem to resume streaming from a position that does not correspond to - // transaction boundary - if (storage.is_in_gtid_replication_mode()) { - storage.discard_incomplete_transaction_events(); - } + // Truncate the in-memory event buffer to the last completed transaction so + // the persisted stream offset matches a transaction boundary. On reconnect, + // reader_context always expects the first logical event after the + // pseudo-preamble to be anonymous_gtid_log / gtid_log / gtid_tagged_log + storage.discard_incomplete_transaction_events(); // connection termination is a good place to flush any remaining data // in the event buffer - this can be considered the third kind of