-
Mauritz Sundell authored
Problem ======= Messages from a failed node may be processed after the node failure been handled due to already queued up signals when transporter turned node communication off. When the transporter (TRPMAN) for the failing node process CLOSE_COMREQ there may be more signals from that node already queued up to all threads including the transporter itself. After CLOSE_COMREQ are processed no new signals will from that node will be queued up. If there is only one transporter block instance running in main thread there should be no problem, since QMGR and NDBCNTR are in same thread and NODE_FAILREP are sent as B-level signal. Solution ======== Ensure all messages queued up before connection to node was closed are processed on all block threads before proceeding with node failure handling. To wait until all already queued up messages are processed synchronize_threads_for_blocks() are used to have transporter send a B-signal to all threads, including to itself, when all threads have responded (on A-level) no thread will have any pending signals from failed node and it is safe to proceed sending NODE_FAILREP to all blocks that need it. In case of multiple node failure one need to ensure that order is kept of the NODE_FAILREP signals, so one must make all transporters synchronize with all threads, even the transporters that is not involved in current node failure. Reviewed-by:
Mikael Ronström <mikael.ronstrom@oracle.com>
Mauritz Sundell authoredProblem ======= Messages from a failed node may be processed after the node failure been handled due to already queued up signals when transporter turned node communication off. When the transporter (TRPMAN) for the failing node process CLOSE_COMREQ there may be more signals from that node already queued up to all threads including the transporter itself. After CLOSE_COMREQ are processed no new signals will from that node will be queued up. If there is only one transporter block instance running in main thread there should be no problem, since QMGR and NDBCNTR are in same thread and NODE_FAILREP are sent as B-level signal. Solution ======== Ensure all messages queued up before connection to node was closed are processed on all block threads before proceeding with node failure handling. To wait until all already queued up messages are processed synchronize_threads_for_blocks() are used to have transporter send a B-signal to all threads, including to itself, when all threads have responded (on A-level) no thread will have any pending signals from failed node and it is safe to proceed sending NODE_FAILREP to all blocks that need it. In case of multiple node failure one need to ensure that order is kept of the NODE_FAILREP signals, so one must make all transporters synchronize with all threads, even the transporters that is not involved in current node failure. Reviewed-by:
Mikael Ronström <mikael.ronstrom@oracle.com>
Loading