sql/ndb_conflict.h · mysql-cluster-7.2.32 · Rasoul Jahanshahi / Mysql Server

Jun 04, 2014

Bug #18465747 NDB REPLICATION : MONITOR FOR UNEXPECTED EPOCH SKIPPING · 567801c7

Frazer Clement authored Jun 04, 2014

This patch modifies the Ndb replication slave layer code to monitor for
epoch skipping behaviour.

Specifically, there have been issues where the generic replication layer
slave retry-on-temp-error code has not functioned correctly, resulting
in an epoch transaction encountering a temp error being skipped entirely
instead of being retried.

Retry-on-temp error is critical to replication correctness, and is 
explicitly used to have multi-pass-apply when using transactional
conflict detection.

To avoid this situation recurring in future, the Ndb slave code is
modified here to check that every epoch which is started (identified
by an ndb_apply_status write_row event) is completed before a 
new epoch is started.

The exception to the rule occurs when the Slave SQL thread is stopped 
and restarted (and hence a CHANGE MASTER could occur).

This gives some protection against replication layer errors, and 
avoids data corruption / harder to debug downstream + later symptoms.

This can be considered an extension of the existing check for epoch
decline.

567801c7

Bug #18465747 NDB REPLICATION : MONITOR FOR UNEXPECTED EPOCH SKIPPING

Frazer Clement authored Jun 04, 2014

This patch modifies the Ndb replication slave layer code to monitor for
epoch skipping behaviour.

Specifically, there have been issues where the generic replication layer
slave retry-on-temp-error code has not functioned correctly, resulting
in an epoch transaction encountering a temp error being skipped entirely
instead of being retried.

Retry-on-temp error is critical to replication correctness, and is 
explicitly used to have multi-pass-apply when using transactional
conflict detection.

To avoid this situation recurring in future, the Ndb slave code is
modified here to check that every epoch which is started (identified
by an ndb_apply_status write_row event) is completed before a 
new epoch is started.

The exception to the rule occurs when the Slave SQL thread is stopped 
and restarted (and hence a CHANGE MASTER could occur).

This gives some protection against replication layer errors, and 
avoids data corruption / harder to debug downstream + later symptoms.

This can be considered an extension of the existing check for epoch
decline.