Skip to content
  • Frazer Clement's avatar
    62049898
    Bug #31008713 NDBD SPORADICALLY LOSES CONNECTION TO OTHER NODES IN AUTOTEST · 62049898
    Frazer Clement authored
    For cluster shutdown, or ALL STOP / ALL RESTART management actions,
    it is possible for different nodes to attempt to stop on different
    GCI boundaries.
    
    If they succeed on stopping on different GCIs then a following
    System Restart (SR) will be slower as the nodes with earlier stop
    GCI must undergo a 'takeover' process as part of SR.
    
    Alternatively, if the set of nodes failing on the first GCI boundary
    makes the surviving nodes non-viable, then the surviving nodes suffer
    an arbitration failure.
    
    This arbitration failure has the positive effect of causing them to
    'stop' in the correct GCI, but the negative effect of appearing to
    be a bug / testcase failure / being ugly.
    
    A testcase is added to testSystemRestart which delays shutdown to
    show the problem.
    
    Extra synchronisation is added to 'graceful' shutdown to reduce
    the chance that different data nodes attempt to shutdown in
    different GCIs.
    
    This should avoid spurious arbitration errors during shutdowns/
    restarts (perhaps more common with larger clusters) and also
    potentially reduce the use of Takeover during SR.
    
    Approved by : Maitrayi Sabaratnam <maitrayi.sabaratnam@oracle.com>
    62049898
    Bug #31008713 NDBD SPORADICALLY LOSES CONNECTION TO OTHER NODES IN AUTOTEST
    Frazer Clement authored
    For cluster shutdown, or ALL STOP / ALL RESTART management actions,
    it is possible for different nodes to attempt to stop on different
    GCI boundaries.
    
    If they succeed on stopping on different GCIs then a following
    System Restart (SR) will be slower as the nodes with earlier stop
    GCI must undergo a 'takeover' process as part of SR.
    
    Alternatively, if the set of nodes failing on the first GCI boundary
    makes the surviving nodes non-viable, then the surviving nodes suffer
    an arbitration failure.
    
    This arbitration failure has the positive effect of causing them to
    'stop' in the correct GCI, but the negative effect of appearing to
    be a bug / testcase failure / being ugly.
    
    A testcase is added to testSystemRestart which delays shutdown to
    show the problem.
    
    Extra synchronisation is added to 'graceful' shutdown to reduce
    the chance that different data nodes attempt to shutdown in
    different GCIs.
    
    This should avoid spurious arbitration errors during shutdowns/
    restarts (perhaps more common with larger clusters) and also
    potentially reduce the use of Takeover during SR.
    
    Approved by : Maitrayi Sabaratnam <maitrayi.sabaratnam@oracle.com>
Loading