storage/ndb/test/run-test/daily-devel--07-tests.txt · a7ea56eadbd58091cfd509e98b735f5c48aace8e · Rasoul Jahanshahi / Mysql Server

May 15, 2020

62049898

Bug #31008713 NDBD SPORADICALLY LOSES CONNECTION TO OTHER NODES IN AUTOTEST · 62049898

Frazer Clement authored May 15, 2020

For cluster shutdown, or ALL STOP / ALL RESTART management actions,
it is possible for different nodes to attempt to stop on different
GCI boundaries.

If they succeed on stopping on different GCIs then a following
System Restart (SR) will be slower as the nodes with earlier stop
GCI must undergo a 'takeover' process as part of SR.

Alternatively, if the set of nodes failing on the first GCI boundary
makes the surviving nodes non-viable, then the surviving nodes suffer
an arbitration failure.

This arbitration failure has the positive effect of causing them to
'stop' in the correct GCI, but the negative effect of appearing to
be a bug / testcase failure / being ugly.

A testcase is added to testSystemRestart which delays shutdown to
show the problem.

Extra synchronisation is added to 'graceful' shutdown to reduce
the chance that different data nodes attempt to shutdown in
different GCIs.

This should avoid spurious arbitration errors during shutdowns/
restarts (perhaps more common with larger clusters) and also
potentially reduce the use of Takeover during SR.

Approved by : Maitrayi Sabaratnam <maitrayi.sabaratnam@oracle.com>

62049898

Bug #31008713 NDBD SPORADICALLY LOSES CONNECTION TO OTHER NODES IN AUTOTEST

Frazer Clement authored May 15, 2020

For cluster shutdown, or ALL STOP / ALL RESTART management actions,
it is possible for different nodes to attempt to stop on different
GCI boundaries.

If they succeed on stopping on different GCIs then a following
System Restart (SR) will be slower as the nodes with earlier stop
GCI must undergo a 'takeover' process as part of SR.

Alternatively, if the set of nodes failing on the first GCI boundary
makes the surviving nodes non-viable, then the surviving nodes suffer
an arbitration failure.

This arbitration failure has the positive effect of causing them to
'stop' in the correct GCI, but the negative effect of appearing to
be a bug / testcase failure / being ugly.

A testcase is added to testSystemRestart which delays shutdown to
show the problem.

Extra synchronisation is added to 'graceful' shutdown to reduce
the chance that different data nodes attempt to shutdown in
different GCIs.

This should avoid spurious arbitration errors during shutdowns/
restarts (perhaps more common with larger clusters) and also
potentially reduce the use of Takeover during SR.

Approved by : Maitrayi Sabaratnam <maitrayi.sabaratnam@oracle.com>