storage/ndb/src/kernel/vm/SimulatedBlock.hpp · 91a17cedb1ee880fe7915fb14cfd74c04e8d6588 · Rasoul Jahanshahi / Mysql Server

Jun 11, 2019

Bug#29832974 WL#12680:CLUSTER CRASHED DURING NODE SHUTDOWN WITH ERROR "DBDIH (LINE: 21882) 0X · d64dcd43

Mauritz Sundell authored Jun 11, 2019



Problem
=======
Messages from a failed node may be processed after the node failure been
handled due to already queued up signals when transporter turned node
communication off.

When the transporter (TRPMAN) for the failing node process CLOSE_COMREQ
there may be more signals from that node already queued up to all threads
including the transporter itself.

After CLOSE_COMREQ are processed no new signals will from that node will be
queued up.

If there is only one transporter block instance running in main thread
there should be no problem, since QMGR and NDBCNTR are in same thread and
NODE_FAILREP are sent as B-level signal.

Solution
========
Ensure all messages queued up before connection to node was closed are
processed on all block threads before proceeding with node failure handling.

To wait until all already queued up messages are processed
synchronize_threads_for_blocks() are used to have transporter send a
B-signal to all threads, including to itself, when all threads have
responded (on A-level) no thread will have any pending signals from failed
node and it is safe to proceed sending NODE_FAILREP to all blocks that need
it.

In case of multiple node failure one need to ensure that order is kept of
the NODE_FAILREP signals, so one must make all transporters synchronize
with all threads, even the transporters that is not involved in current
node failure.

Reviewed-by: Mikael Ronström <mikael.ronstrom@oracle.com>

d64dcd43

Bug#29832974 WL#12680:CLUSTER CRASHED DURING NODE SHUTDOWN WITH ERROR "DBDIH (LINE: 21882) 0X

Mauritz Sundell authored Jun 11, 2019



Problem
=======
Messages from a failed node may be processed after the node failure been
handled due to already queued up signals when transporter turned node
communication off.

When the transporter (TRPMAN) for the failing node process CLOSE_COMREQ
there may be more signals from that node already queued up to all threads
including the transporter itself.

After CLOSE_COMREQ are processed no new signals will from that node will be
queued up.

If there is only one transporter block instance running in main thread
there should be no problem, since QMGR and NDBCNTR are in same thread and
NODE_FAILREP are sent as B-level signal.

Solution
========
Ensure all messages queued up before connection to node was closed are
processed on all block threads before proceeding with node failure handling.

To wait until all already queued up messages are processed
synchronize_threads_for_blocks() are used to have transporter send a
B-signal to all threads, including to itself, when all threads have
responded (on A-level) no thread will have any pending signals from failed
node and it is safe to proceed sending NODE_FAILREP to all blocks that need
it.

In case of multiple node failure one need to ensure that order is kept of
the NODE_FAILREP signals, so one must make all transporters synchronize
with all threads, even the transporters that is not involved in current
node failure.

Reviewed-by: Mikael Ronström <mikael.ronstrom@oracle.com>