-
Frazer Clement authored
The LCP Scan Frag watchdog and GCP Monitor can both decide to exclude a node if it is too slow when participating in these protocols. Currently the exclusion is implemented by asking the failing node to shutdown. This allows it to first log some debugging information, and shutdown with a clear failure cause. However in some situations it may be slow to shutdown, and prolong the duration of GCP/LCP stall for the other unaffected nodes. To minimise this time, this fix adds an isolation mechanism which causes the other live nodes to forcibly disconnect the failing node after some delay. This gives the failing node the chance to shutdown with debugging info and a good message if possible, but limits the time the others must wait for this to occur. Once the live nodes have processed the disconnection of the failing nodes, they can commence failure handling and restart the protocol(s). Even if the failed node takes a long time to shutdown, the others can proceed with processing. The GcpMonitor and the Lcp Scan Fragment watchdog are enhanced to make use of this mechanism. Three new testcases are added : 1. GcpStop Testing of GcpStop handling in normal cases 2. GcpStopIsolation Testing of GcpStop self-shutdown failure so that Isolation is required 3. LcpScanFragWatchdogIsolation Testing of Lcp Scan Fragment Watchdog where Isolation is required. These are added to the daily-devel test suite. Additionally : Bug #20128256 NDB : GCP STOP MONITOR HAS ONLY ONE BULLET This bug was discovered while testing (GcpStop testcase). The Gcp Monitor did not continue operation after detecting a Gcp stop. This is fixed so that it does continue operation after detecting a Gcp stop, and this is tested by both the GcpStop and GcpStopIsolation testcases (where the Master node is not a victim and must detect and handle multiple separate GCP stop events).
Frazer Clement authoredThe LCP Scan Frag watchdog and GCP Monitor can both decide to exclude a node if it is too slow when participating in these protocols. Currently the exclusion is implemented by asking the failing node to shutdown. This allows it to first log some debugging information, and shutdown with a clear failure cause. However in some situations it may be slow to shutdown, and prolong the duration of GCP/LCP stall for the other unaffected nodes. To minimise this time, this fix adds an isolation mechanism which causes the other live nodes to forcibly disconnect the failing node after some delay. This gives the failing node the chance to shutdown with debugging info and a good message if possible, but limits the time the others must wait for this to occur. Once the live nodes have processed the disconnection of the failing nodes, they can commence failure handling and restart the protocol(s). Even if the failed node takes a long time to shutdown, the others can proceed with processing. The GcpMonitor and the Lcp Scan Fragment watchdog are enhanced to make use of this mechanism. Three new testcases are added : 1. GcpStop Testing of GcpStop handling in normal cases 2. GcpStopIsolation Testing of GcpStop self-shutdown failure so that Isolation is required 3. LcpScanFragWatchdogIsolation Testing of Lcp Scan Fragment Watchdog where Isolation is required. These are added to the daily-devel test suite. Additionally : Bug #20128256 NDB : GCP STOP MONITOR HAS ONLY ONE BULLET This bug was discovered while testing (GcpStop testcase). The Gcp Monitor did not continue operation after detecting a Gcp stop. This is fixed so that it does continue operation after detecting a Gcp stop, and this is tested by both the GcpStop and GcpStopIsolation testcases (where the Master node is not a victim and must detect and handle multiple separate GCP stop events).
Loading