Skip to content
  • Mauritz Sundell's avatar
    d64dcd43
    Bug#29832974 WL#12680:CLUSTER CRASHED DURING NODE SHUTDOWN WITH ERROR "DBDIH (LINE: 21882) 0X · d64dcd43
    Mauritz Sundell authored
    
    
    Problem
    =======
    Messages from a failed node may be processed after the node failure been
    handled due to already queued up signals when transporter turned node
    communication off.
    
    When the transporter (TRPMAN) for the failing node process CLOSE_COMREQ
    there may be more signals from that node already queued up to all threads
    including the transporter itself.
    
    After CLOSE_COMREQ are processed no new signals will from that node will be
    queued up.
    
    If there is only one transporter block instance running in main thread
    there should be no problem, since QMGR and NDBCNTR are in same thread and
    NODE_FAILREP are sent as B-level signal.
    
    Solution
    ========
    Ensure all messages queued up before connection to node was closed are
    processed on all block threads before proceeding with node failure handling.
    
    To wait until all already queued up messages are processed
    synchronize_threads_for_blocks() are used to have transporter send a
    B-signal to all threads, including to itself, when all threads have
    responded (on A-level) no thread will have any pending signals from failed
    node and it is safe to proceed sending NODE_FAILREP to all blocks that need
    it.
    
    In case of multiple node failure one need to ensure that order is kept of
    the NODE_FAILREP signals, so one must make all transporters synchronize
    with all threads, even the transporters that is not involved in current
    node failure.
    
    Reviewed-by: default avatarMikael Ronström <mikael.ronstrom@oracle.com>
    d64dcd43
    Bug#29832974 WL#12680:CLUSTER CRASHED DURING NODE SHUTDOWN WITH ERROR "DBDIH (LINE: 21882) 0X
    Mauritz Sundell authored
    
    
    Problem
    =======
    Messages from a failed node may be processed after the node failure been
    handled due to already queued up signals when transporter turned node
    communication off.
    
    When the transporter (TRPMAN) for the failing node process CLOSE_COMREQ
    there may be more signals from that node already queued up to all threads
    including the transporter itself.
    
    After CLOSE_COMREQ are processed no new signals will from that node will be
    queued up.
    
    If there is only one transporter block instance running in main thread
    there should be no problem, since QMGR and NDBCNTR are in same thread and
    NODE_FAILREP are sent as B-level signal.
    
    Solution
    ========
    Ensure all messages queued up before connection to node was closed are
    processed on all block threads before proceeding with node failure handling.
    
    To wait until all already queued up messages are processed
    synchronize_threads_for_blocks() are used to have transporter send a
    B-signal to all threads, including to itself, when all threads have
    responded (on A-level) no thread will have any pending signals from failed
    node and it is safe to proceed sending NODE_FAILREP to all blocks that need
    it.
    
    In case of multiple node failure one need to ensure that order is kept of
    the NODE_FAILREP signals, so one must make all transporters synchronize
    with all threads, even the transporters that is not involved in current
    node failure.
    
    Reviewed-by: default avatarMikael Ronström <mikael.ronstrom@oracle.com>
Loading