Skip to content
  • Frazer Clement's avatar
    2f11308e
    Bug #19858151 MAKING DATA NODE SELF-EXCLUDE MECHANISM MORE ROBUST · 2f11308e
    Frazer Clement authored
    The LCP Scan Frag watchdog and GCP Monitor can both decide to 
    exclude a node if it is too slow when participating in these
    protocols.
    
    Currently the exclusion is implemented by asking the failing node
    to shutdown.
    
    This allows it to first log some debugging information, and 
    shutdown with a clear failure cause.
    
    However in some situations it may be slow to shutdown, and prolong
    the duration of GCP/LCP stall for the other unaffected nodes.
    
    To minimise this time, this fix adds an isolation mechanism which
    causes the other live nodes to forcibly disconnect the failing
    node after some delay.
    
    This gives the failing node the chance to shutdown with debugging
    info and a good message if possible, but limits the time the others
    must wait for this to occur.
    
    Once the live nodes have processed the disconnection of the failing
    nodes, they can commence failure handling and restart the protocol(s).
    
    Even if the failed node takes a long time to shutdown, the others
    can proceed with processing. 
    
    The GcpMonitor and the Lcp Scan Fragment watchdog are enhanced to 
    make use of this mechanism.
    
    Three new testcases are added :
     1.  GcpStop
         Testing of GcpStop handling in normal cases
     2.  GcpStopIsolation
         Testing of GcpStop self-shutdown failure so that Isolation is 
         required
     3.  LcpScanFragWatchdogIsolation
         Testing of Lcp Scan Fragment Watchdog where Isolation is 
         required.
    
    These are added to the daily-devel test suite. 
    
    Additionally :
    
    Bug #20128256 	NDB : GCP STOP MONITOR HAS ONLY ONE BULLET
    
    This bug was discovered while testing (GcpStop testcase).
    
    The Gcp Monitor did not continue operation after detecting a Gcp stop.
    
    This is fixed so that it does continue operation after detecting 
    a Gcp stop, and this is tested by both the GcpStop and GcpStopIsolation
    testcases (where the Master node is not a victim and must detect and handle
    multiple separate GCP stop events).
    2f11308e
    Bug #19858151 MAKING DATA NODE SELF-EXCLUDE MECHANISM MORE ROBUST
    Frazer Clement authored
    The LCP Scan Frag watchdog and GCP Monitor can both decide to 
    exclude a node if it is too slow when participating in these
    protocols.
    
    Currently the exclusion is implemented by asking the failing node
    to shutdown.
    
    This allows it to first log some debugging information, and 
    shutdown with a clear failure cause.
    
    However in some situations it may be slow to shutdown, and prolong
    the duration of GCP/LCP stall for the other unaffected nodes.
    
    To minimise this time, this fix adds an isolation mechanism which
    causes the other live nodes to forcibly disconnect the failing
    node after some delay.
    
    This gives the failing node the chance to shutdown with debugging
    info and a good message if possible, but limits the time the others
    must wait for this to occur.
    
    Once the live nodes have processed the disconnection of the failing
    nodes, they can commence failure handling and restart the protocol(s).
    
    Even if the failed node takes a long time to shutdown, the others
    can proceed with processing. 
    
    The GcpMonitor and the Lcp Scan Fragment watchdog are enhanced to 
    make use of this mechanism.
    
    Three new testcases are added :
     1.  GcpStop
         Testing of GcpStop handling in normal cases
     2.  GcpStopIsolation
         Testing of GcpStop self-shutdown failure so that Isolation is 
         required
     3.  LcpScanFragWatchdogIsolation
         Testing of Lcp Scan Fragment Watchdog where Isolation is 
         required.
    
    These are added to the daily-devel test suite. 
    
    Additionally :
    
    Bug #20128256 	NDB : GCP STOP MONITOR HAS ONLY ONE BULLET
    
    This bug was discovered while testing (GcpStop testcase).
    
    The Gcp Monitor did not continue operation after detecting a Gcp stop.
    
    This is fixed so that it does continue operation after detecting 
    a Gcp stop, and this is tested by both the GcpStop and GcpStopIsolation
    testcases (where the Master node is not a victim and must detect and handle
    multiple separate GCP stop events).
Loading