Skip to content
  • Frazer Clement's avatar
    567801c7
    Bug #18465747 NDB REPLICATION : MONITOR FOR UNEXPECTED EPOCH SKIPPING · 567801c7
    Frazer Clement authored
    This patch modifies the Ndb replication slave layer code to monitor for
    epoch skipping behaviour.
    
    Specifically, there have been issues where the generic replication layer
    slave retry-on-temp-error code has not functioned correctly, resulting
    in an epoch transaction encountering a temp error being skipped entirely
    instead of being retried.
    
    Retry-on-temp error is critical to replication correctness, and is 
    explicitly used to have multi-pass-apply when using transactional
    conflict detection.
    
    To avoid this situation recurring in future, the Ndb slave code is
    modified here to check that every epoch which is started (identified
    by an ndb_apply_status write_row event) is completed before a 
    new epoch is started.
    
    The exception to the rule occurs when the Slave SQL thread is stopped 
    and restarted (and hence a CHANGE MASTER could occur).
    
    This gives some protection against replication layer errors, and 
    avoids data corruption / harder to debug downstream + later symptoms.
    
    This can be considered an extension of the existing check for epoch
    decline.
    567801c7
    Bug #18465747 NDB REPLICATION : MONITOR FOR UNEXPECTED EPOCH SKIPPING
    Frazer Clement authored
    This patch modifies the Ndb replication slave layer code to monitor for
    epoch skipping behaviour.
    
    Specifically, there have been issues where the generic replication layer
    slave retry-on-temp-error code has not functioned correctly, resulting
    in an epoch transaction encountering a temp error being skipped entirely
    instead of being retried.
    
    Retry-on-temp error is critical to replication correctness, and is 
    explicitly used to have multi-pass-apply when using transactional
    conflict detection.
    
    To avoid this situation recurring in future, the Ndb slave code is
    modified here to check that every epoch which is started (identified
    by an ndb_apply_status write_row event) is completed before a 
    new epoch is started.
    
    The exception to the rule occurs when the Slave SQL thread is stopped 
    and restarted (and hence a CHANGE MASTER could occur).
    
    This gives some protection against replication layer errors, and 
    avoids data corruption / harder to debug downstream + later symptoms.
    
    This can be considered an extension of the existing check for epoch
    decline.
Loading