mysql-test/suite/ndb_rpl/t/ndb_rpl_change_master_to_epoch.inc · ba12387e40a6cf2adb3746e9dc5271f1772ce7ab · Rasoul Jahanshahi / Mysql Server

Nov 25, 2019

Bug #30411728 NDB_RPL.NDB_RPL_BACKUP_EPOCH FAILS IN PB2 WITH RESULT CONTENT · 0c50b422

Frazer Clement authored Nov 25, 2019

MIS$

The previous debugging push gave enough context to understand the
intermittent failures with this testcase.

The testcase is restoring a Master backup on a Slave, then using
the restore point epoch to resume replication.
The testcase checks to ensure that the backup restore point
and the binlog content following this point are mutually exclusive,
based on the number of inserted rows restored by each.

However, the logic to set the binlog position based on the restore
epoch was assuming that there would always be a following epoch in
the master's ndb_binlog_index file, which could be used to set
the position.

If this was not the case then the query returned no results, and
the previous settings for the file and position variables were
used, giving non deterministic results.

Presumably this is caused by the backup being slower on slow hosts,
so it has a restore point after all of the changes (causing epochs
to be recorded) are finished.

This problem is fixed by improving the logic to set the binlog
position to handle the 3 scenarios :
 a) An epoch record exists for the restore point
    (Probably never hit for a restore, but common for channel cutover)
 b) An epoch record exists after the restore point
    (Common when there is ongoing activity on the master
 c) No epoch record exists after the restore point, but one does
    exist before it
    (Common when the master is quiet)

So if the backup completion point is after the end of the changes
then we will go to case c) and we will choose a restore point after
the backup.

To handle the case where there is nothing to restore from the log,
we map a max row value of NULL from the log to 200.

Approved by : Priyanka Sangam <priyanka.sangam@oracle.com>

0c50b422

Bug #30411728 NDB_RPL.NDB_RPL_BACKUP_EPOCH FAILS IN PB2 WITH RESULT CONTENT

Frazer Clement authored Nov 25, 2019

MIS$

The previous debugging push gave enough context to understand the
intermittent failures with this testcase.

The testcase is restoring a Master backup on a Slave, then using
the restore point epoch to resume replication.
The testcase checks to ensure that the backup restore point
and the binlog content following this point are mutually exclusive,
based on the number of inserted rows restored by each.

However, the logic to set the binlog position based on the restore
epoch was assuming that there would always be a following epoch in
the master's ndb_binlog_index file, which could be used to set
the position.

If this was not the case then the query returned no results, and
the previous settings for the file and position variables were
used, giving non deterministic results.

Presumably this is caused by the backup being slower on slow hosts,
so it has a restore point after all of the changes (causing epochs
to be recorded) are finished.

This problem is fixed by improving the logic to set the binlog
position to handle the 3 scenarios :
 a) An epoch record exists for the restore point
    (Probably never hit for a restore, but common for channel cutover)
 b) An epoch record exists after the restore point
    (Common when there is ongoing activity on the master
 c) No epoch record exists after the restore point, but one does
    exist before it
    (Common when the master is quiet)

So if the backup completion point is after the end of the changes
then we will go to case c) and we will choose a restore point after
the backup.

To handle the case where there is nothing to restore from the log,
we map a max row value of NULL from the log to 200.

Approved by : Priyanka Sangam <priyanka.sangam@oracle.com>