storage/ndb/include/kernel/signaldata/FailRep.hpp · e0e0ae2ea27c9bb76577664845507ef224d362e4 · Rasoul Jahanshahi / Mysql Server

Dec 09, 2014

Bug #19858151 MAKING DATA NODE SELF-EXCLUDE MECHANISM MORE ROBUST · 2f11308e

Frazer Clement authored Dec 09, 2014

The LCP Scan Frag watchdog and GCP Monitor can both decide to 
exclude a node if it is too slow when participating in these
protocols.

Currently the exclusion is implemented by asking the failing node
to shutdown.

This allows it to first log some debugging information, and 
shutdown with a clear failure cause.

However in some situations it may be slow to shutdown, and prolong
the duration of GCP/LCP stall for the other unaffected nodes.

To minimise this time, this fix adds an isolation mechanism which
causes the other live nodes to forcibly disconnect the failing
node after some delay.

This gives the failing node the chance to shutdown with debugging
info and a good message if possible, but limits the time the others
must wait for this to occur.

Once the live nodes have processed the disconnection of the failing
nodes, they can commence failure handling and restart the protocol(s).

Even if the failed node takes a long time to shutdown, the others
can proceed with processing. 

The GcpMonitor and the Lcp Scan Fragment watchdog are enhanced to 
make use of this mechanism.

Three new testcases are added :
 1.  GcpStop
     Testing of GcpStop handling in normal cases
 2.  GcpStopIsolation
     Testing of GcpStop self-shutdown failure so that Isolation is 
     required
 3.  LcpScanFragWatchdogIsolation
     Testing of Lcp Scan Fragment Watchdog where Isolation is 
     required.

These are added to the daily-devel test suite. 

Additionally :

Bug #20128256 	NDB : GCP STOP MONITOR HAS ONLY ONE BULLET

This bug was discovered while testing (GcpStop testcase).

The Gcp Monitor did not continue operation after detecting a Gcp stop.

This is fixed so that it does continue operation after detecting 
a Gcp stop, and this is tested by both the GcpStop and GcpStopIsolation
testcases (where the Master node is not a victim and must detect and handle
multiple separate GCP stop events).

2f11308e

Bug #19858151 MAKING DATA NODE SELF-EXCLUDE MECHANISM MORE ROBUST

Frazer Clement authored Dec 09, 2014

The LCP Scan Frag watchdog and GCP Monitor can both decide to 
exclude a node if it is too slow when participating in these
protocols.

Currently the exclusion is implemented by asking the failing node
to shutdown.

This allows it to first log some debugging information, and 
shutdown with a clear failure cause.

However in some situations it may be slow to shutdown, and prolong
the duration of GCP/LCP stall for the other unaffected nodes.

To minimise this time, this fix adds an isolation mechanism which
causes the other live nodes to forcibly disconnect the failing
node after some delay.

This gives the failing node the chance to shutdown with debugging
info and a good message if possible, but limits the time the others
must wait for this to occur.

Once the live nodes have processed the disconnection of the failing
nodes, they can commence failure handling and restart the protocol(s).

Even if the failed node takes a long time to shutdown, the others
can proceed with processing. 

The GcpMonitor and the Lcp Scan Fragment watchdog are enhanced to 
make use of this mechanism.

Three new testcases are added :
 1.  GcpStop
     Testing of GcpStop handling in normal cases
 2.  GcpStopIsolation
     Testing of GcpStop self-shutdown failure so that Isolation is 
     required
 3.  LcpScanFragWatchdogIsolation
     Testing of Lcp Scan Fragment Watchdog where Isolation is 
     required.

These are added to the daily-devel test suite. 

Additionally :

Bug #20128256 	NDB : GCP STOP MONITOR HAS ONLY ONE BULLET

This bug was discovered while testing (GcpStop testcase).

The Gcp Monitor did not continue operation after detecting a Gcp stop.

This is fixed so that it does continue operation after detecting 
a Gcp stop, and this is tested by both the GcpStop and GcpStopIsolation
testcases (where the Master node is not a victim and must detect and handle
multiple separate GCP stop events).