Skip to content
  • Sanjana DS's avatar
    02cd1db6
    WL#12564 Modify node bitmasks and dependent protocols to allow more nodes · 02cd1db6
    Sanjana DS authored
    
    Bug#29211078 DANGLING REFERENCE TO BUFFER OUT OF SCOPE IN QMGR::FAILREPORTLAB
    Bug#29443052 : WL#12680: CLUSTER CRASHED DURING NODE RESTART WITH AN ERROR
    Bug#29458648 : WL#12680: NODES CRASHED DURING SHUTDOWN OF CLUSTER WITH DIFFERENT ERRORS ..
    Bug#29814551 WL#12680: "INITIAL START, WAITING 15 FOR " LIST UNUSED NODE ID
    
    This patch changes the way a NodeBitMask object is sent across in signals.
    The node bitmask should now be sent in a signal section.
    Also, only the words until the last non-zero word from the left can be sent
    which is made possible by adding a function- getPackedLengthInWords() to struct BitmaskPOD.
    This eliminates the need to send unnecessary data and makes the signals shorter.
    
    Also add function in ndb_version.h.in to check if the cluster version supports sending and
    receiving node bitmasks in a signal section.
    
    Reviewed-by: default avatarMauritz Sundell <mauritz.sundell@oracle.com>
    Reviewed-by: default avatarMikael Ronström <mikael.ronstrom@oracle.com>
    Reviewed-by: default avatarSanjana DS <sanjana.ds@oracle.com>
    
    Squashed comments
    =================
    
    WL #12564 : Adapt CM_REGREQ to deal with the node bitmask changes.
    
    Send a dummy node bitmask with all zeroes in GSN_CM_REGREQ since it's not used
    at the receiving end.
    This is for backward compatibility.
    
    WL #12564 : Modify CM_REGCONF protocol to deal with the node bitmask changes
    
    Send node bitmask info in long section of the signal since it wont
    fit in a normal signal if we increase the node bitmask size.
    Also add relevant upgrade/downgrade code to cope up with these changes.
    
    WL #12564 : Modify CM_REGREF to allow more nodes
    
    Adapt CM_REGREF to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL #12564 : Modify FAIL_REP to allow more nodes
    
    Adapt FAIL_REP to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL #12564 : Modify MASTER_GCPCONF to allow more nodes
    
    Adapt MASTER_GCPCONF to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL #12564 : Modify PREP_FAILREQ and PREP_FAILREF to allow more nodes
    
    Adapt PREP_FAILREQ and PREP_FAILREF to deal with the node bitmask changes
    and add relevant upgrade/downgrade code.
    
    WL #12564 : Modify START_LCP_REQ to allow more nodes
    
    Adapt START_LCP_REQ to deal with the node bitmask changes and add
    relevant upgrade/downgrade code.
    
    WL #12564 : Modify CNTR_START_CONF to allow more nodes
    
    Adapt CNTR_START_CONF to deal with the node bitmask changes and
    add relevant upgrade/downgrade code.
    
    WL#12564: Remove HOT_SPAREREP signal entirely
    
    WL#12564: Remove CM_INIT completely
    
    WL#12564: Remove CNTR_MASTER* signals entirely
    
    WL#12564: BACKUP_CONF, BACKUP_COMPLETE_REP
    
    Removed Ndb node bitmasks from both BACKUP_CONF and BACKUP_COMPLETE_REP
    signals. They were sent but not used at reception of the signal.
    
    WL#12564: CHECKNODEGROUPSREQ/CONF
    
    CHECKNODEGROUPSREQ and CHECKNODEGROUPSCONF has a Ndb node bitmask in its
    signal. The signal is only sent locally in a node, so no upgrade logic
    is required. In most cases the signal is sent as EXECUTE_DIRECT, in those
    cases no special treatment is required. SUMA sends the signal and receives
    it. SUMA only requires the node bitmask to be received, but to ensure
    that new uses of this signal will be possible without changing DIH we
    add support to receive the node bitmask in a section. In this case we
    send it as cleared from SUMA.
    
    WL#12564: CLOSE_COMREQ/CONF
    
    CLOSE_COMREQ and CLOSE_COMCONF uses a bitmask to communicate to TRPMAN from
    QMGR. Actually a bitmask is only needed in CLOSE_COMREQ since the bitmask
    returned in CLOSE_COMCONF is always the same bitmask as sent in CLOSE_COMREQ.
    For some reason we even use this to assign the same variable as we got the
    bitmask from.
    
    So changed signal to carry single node id for cases where no bitmask was
    needed. In the case where a bitmask was needed it was only a bitmask for
    data nodes and this is transported as usual in a section. Some extra code
    was required in TrpmanProxy to ensure that sections could be carried in
    the CLOSE_COMREQ signal.
    
    WL#12564: ENABLE_COMREQ/CONF
    
    ENABLE_COMREQ sends a bitmask with nodes to enable communication with.
    In most cases only a single node is enabled at a time, only in one case
    is a bitmask needed. This is solved in the same fashion as for
    CLOSE_COMREQ/CONF. ENABLE_COMCONF needs only the single node sent
    back, no need to send bitmask back in ENABLE_COMCONF.
    
    ENABLE_COMREQ/CONF is only sent local in a node, so no need to
    handle upgrade situations.
    
    WL#12564: DIH_RESTARTREF/CONF
    
    Bitmask is sent in response to DIH_RESTARTREQ. DIH_RESTART* are local
    signals, so no need to handle upgrade situations. DIH_RESTARTREQ can
    send a bitmask and array of GCIs, but this is only done in
    EXECUTE_DIRECT. NDBCNTR receives DIH_RESTARTREF/CONF as well, but
    don't bother with bitmasks, so it is enough to release the section.
    
    WL#12564: READ_NODESCONF
    
    READ_NODESCONF contains 5 data node bitmasks. These are put into a section
    in unpacked format. All these signals are only sent in startup and is
    local to the node, so no reason for upgrade code.
    
    WL #12564 : Modify STOP_REQ to allow more nodes
    
    Adapt STOP_REQ to deal with the node bitmask changes and
    add relevant upgrade/downgrade code.
    
    WL#12564: Remove EMPTY_LCP protocol support
    
    Requires 7.4.3 to support upgrades now.
    
    WL#12564: DIH_RESTARTCONF
    
    Sent bitmask in a section from DIH as it was expected to be received
    by both NDBCNTR and QMGR.
    
    WL#12564: DIH_RESTARTCONF
    
    In reception of DIH_RESTARTCONF in QMGR we used the wrong signal to
    copy bitmask into, this led to overwrite the GCI value sent from
    DIH that was then 0 and this led to President in QMGR not being the
    same as the Master in NDBCNTR.
    
    Added more documentation of QMGR behaviour.
    
    Added a bit more printouts to be able to debug problems with allocation
    of node ids that sometimes cause issues in NDB.
    
    WL#12564: ISOLATE_ORD
    
    Ensure that ISOLATE_ORD can handle longer data node bitmasks.
    
    WL #12564 : Modify DEFINE_BACKUP_REQ to allow more nodes
    
    Adapt DEFINE_BACKUP_REQ to deal with the node bitmask changes and
    add relevant upgrade/downgrade code.
    
    WL #12564 : Modify START_RECREQ to allow more nodes
    
    Adapt START_RECREQ to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL#12564: Support for EVENT_REP
    
    Handled StartReport with 5 bitmasks, ConnectCheckStarted with 2 bitmasks,
    InfoEvent extended to support lengths up to 4091 bytes and same for
    WarningEvent. Limit of data in EVENT_REP signal section set to
    1024 words. Limited code to handle MGM server in lower version than
    data node.
    
    Long info events must have type repeated in section.
    
    In SavedEventBuffer::scan()
    -  assert(data_len <= 25);
    +  require(data_len <= MAX_EVENT_REP_SIZE_WORDS);
    
    WL #12564 : Modify NODE_FAILREP to allow more nodes
    
    Adapt NODE_FAILREP to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL# 12564: FAIL_REP fine tuning
    
    WL #12564 : Fix MASTER_GCPCONF
    
    WL#12564: ISOLATE_ORD
    
    WL# 12564: FAIL_REP fine tuning
    
    WL# 12564: START_LCP_REQ bitmask transmission improvements
    
    WL# 12564: PREP_FAILREQ and PREP_FAILREF cosmetic changes
    
    WL#12564: EVENT_REP problem in warningEvent
    
    WL# 12564: STOP_REQ improvements
    
    WL# 12564: CM_REGREQ, CM_REGCONF, CM_REGREF, DEFINE_BACKUP_REQ code improvements
    
    Use NdbNodeBitmask48::Size instead of 2.
    
    WL# 12564: START_RECREQ improvements
    
    WL# 12564: NODE_FAILREP cosmetic changes
    
    WL#12564: Support for EVENT_REP
    
    Handled StartReport with 5 bitmasks, ConnectCheckStarted with 2 bitmasks,
    InfoEvent extended to support lengths up to 4091 bytes and same for
    WarningEvent. Limit of data in EVENT_REP signal section set to
    1024 words. Limited code to handle MGM server in lower version than
    data node.
    
    WL#12564 Pass node bitmask in section for CNTR_WAITREP:ZWAITPOINT_4_2
    
    To old nodes continue to send it in signal.
    
    Bug#29211078 DANGLING REFERENCE TO BUFFER OUT OF SCOPE IN QMGR::FAILREPORTLAB
    
    Remove dangling reference to buffer out of scope by moving definition of
    buffer extra to same scope as pointer msg.
    
    WL#12564 Use TextLength constant for bitmask text buffer sizes.
    
    General replace of size expression for character buffers used in call to
    getText() to produce hexdump of node bitmasks.
    
    Typically changing
    
      char buf[100];
    
    to
    
      char buf[NdbNodeBitmask::TextLength + 1];
    
    WL# 12564: Send and receive bitmask in NDB_STARTCONF through signal section
    
    WL# 12564: Node bitmask in CONTINUEB in DBSPJ block through signal section
    
    WL# 12564: Zero the node bitmask in GSN_EVENT_REP for ignorance
    
    The node bitmask sent by the BACKUP block is not used at the receiver (mgm client).
    Hence, zero those bits for ignorance.
    02cd1db6
    WL#12564 Modify node bitmasks and dependent protocols to allow more nodes
    Sanjana DS authored
    
    Bug#29211078 DANGLING REFERENCE TO BUFFER OUT OF SCOPE IN QMGR::FAILREPORTLAB
    Bug#29443052 : WL#12680: CLUSTER CRASHED DURING NODE RESTART WITH AN ERROR
    Bug#29458648 : WL#12680: NODES CRASHED DURING SHUTDOWN OF CLUSTER WITH DIFFERENT ERRORS ..
    Bug#29814551 WL#12680: "INITIAL START, WAITING 15 FOR " LIST UNUSED NODE ID
    
    This patch changes the way a NodeBitMask object is sent across in signals.
    The node bitmask should now be sent in a signal section.
    Also, only the words until the last non-zero word from the left can be sent
    which is made possible by adding a function- getPackedLengthInWords() to struct BitmaskPOD.
    This eliminates the need to send unnecessary data and makes the signals shorter.
    
    Also add function in ndb_version.h.in to check if the cluster version supports sending and
    receiving node bitmasks in a signal section.
    
    Reviewed-by: default avatarMauritz Sundell <mauritz.sundell@oracle.com>
    Reviewed-by: default avatarMikael Ronström <mikael.ronstrom@oracle.com>
    Reviewed-by: default avatarSanjana DS <sanjana.ds@oracle.com>
    
    Squashed comments
    =================
    
    WL #12564 : Adapt CM_REGREQ to deal with the node bitmask changes.
    
    Send a dummy node bitmask with all zeroes in GSN_CM_REGREQ since it's not used
    at the receiving end.
    This is for backward compatibility.
    
    WL #12564 : Modify CM_REGCONF protocol to deal with the node bitmask changes
    
    Send node bitmask info in long section of the signal since it wont
    fit in a normal signal if we increase the node bitmask size.
    Also add relevant upgrade/downgrade code to cope up with these changes.
    
    WL #12564 : Modify CM_REGREF to allow more nodes
    
    Adapt CM_REGREF to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL #12564 : Modify FAIL_REP to allow more nodes
    
    Adapt FAIL_REP to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL #12564 : Modify MASTER_GCPCONF to allow more nodes
    
    Adapt MASTER_GCPCONF to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL #12564 : Modify PREP_FAILREQ and PREP_FAILREF to allow more nodes
    
    Adapt PREP_FAILREQ and PREP_FAILREF to deal with the node bitmask changes
    and add relevant upgrade/downgrade code.
    
    WL #12564 : Modify START_LCP_REQ to allow more nodes
    
    Adapt START_LCP_REQ to deal with the node bitmask changes and add
    relevant upgrade/downgrade code.
    
    WL #12564 : Modify CNTR_START_CONF to allow more nodes
    
    Adapt CNTR_START_CONF to deal with the node bitmask changes and
    add relevant upgrade/downgrade code.
    
    WL#12564: Remove HOT_SPAREREP signal entirely
    
    WL#12564: Remove CM_INIT completely
    
    WL#12564: Remove CNTR_MASTER* signals entirely
    
    WL#12564: BACKUP_CONF, BACKUP_COMPLETE_REP
    
    Removed Ndb node bitmasks from both BACKUP_CONF and BACKUP_COMPLETE_REP
    signals. They were sent but not used at reception of the signal.
    
    WL#12564: CHECKNODEGROUPSREQ/CONF
    
    CHECKNODEGROUPSREQ and CHECKNODEGROUPSCONF has a Ndb node bitmask in its
    signal. The signal is only sent locally in a node, so no upgrade logic
    is required. In most cases the signal is sent as EXECUTE_DIRECT, in those
    cases no special treatment is required. SUMA sends the signal and receives
    it. SUMA only requires the node bitmask to be received, but to ensure
    that new uses of this signal will be possible without changing DIH we
    add support to receive the node bitmask in a section. In this case we
    send it as cleared from SUMA.
    
    WL#12564: CLOSE_COMREQ/CONF
    
    CLOSE_COMREQ and CLOSE_COMCONF uses a bitmask to communicate to TRPMAN from
    QMGR. Actually a bitmask is only needed in CLOSE_COMREQ since the bitmask
    returned in CLOSE_COMCONF is always the same bitmask as sent in CLOSE_COMREQ.
    For some reason we even use this to assign the same variable as we got the
    bitmask from.
    
    So changed signal to carry single node id for cases where no bitmask was
    needed. In the case where a bitmask was needed it was only a bitmask for
    data nodes and this is transported as usual in a section. Some extra code
    was required in TrpmanProxy to ensure that sections could be carried in
    the CLOSE_COMREQ signal.
    
    WL#12564: ENABLE_COMREQ/CONF
    
    ENABLE_COMREQ sends a bitmask with nodes to enable communication with.
    In most cases only a single node is enabled at a time, only in one case
    is a bitmask needed. This is solved in the same fashion as for
    CLOSE_COMREQ/CONF. ENABLE_COMCONF needs only the single node sent
    back, no need to send bitmask back in ENABLE_COMCONF.
    
    ENABLE_COMREQ/CONF is only sent local in a node, so no need to
    handle upgrade situations.
    
    WL#12564: DIH_RESTARTREF/CONF
    
    Bitmask is sent in response to DIH_RESTARTREQ. DIH_RESTART* are local
    signals, so no need to handle upgrade situations. DIH_RESTARTREQ can
    send a bitmask and array of GCIs, but this is only done in
    EXECUTE_DIRECT. NDBCNTR receives DIH_RESTARTREF/CONF as well, but
    don't bother with bitmasks, so it is enough to release the section.
    
    WL#12564: READ_NODESCONF
    
    READ_NODESCONF contains 5 data node bitmasks. These are put into a section
    in unpacked format. All these signals are only sent in startup and is
    local to the node, so no reason for upgrade code.
    
    WL #12564 : Modify STOP_REQ to allow more nodes
    
    Adapt STOP_REQ to deal with the node bitmask changes and
    add relevant upgrade/downgrade code.
    
    WL#12564: Remove EMPTY_LCP protocol support
    
    Requires 7.4.3 to support upgrades now.
    
    WL#12564: DIH_RESTARTCONF
    
    Sent bitmask in a section from DIH as it was expected to be received
    by both NDBCNTR and QMGR.
    
    WL#12564: DIH_RESTARTCONF
    
    In reception of DIH_RESTARTCONF in QMGR we used the wrong signal to
    copy bitmask into, this led to overwrite the GCI value sent from
    DIH that was then 0 and this led to President in QMGR not being the
    same as the Master in NDBCNTR.
    
    Added more documentation of QMGR behaviour.
    
    Added a bit more printouts to be able to debug problems with allocation
    of node ids that sometimes cause issues in NDB.
    
    WL#12564: ISOLATE_ORD
    
    Ensure that ISOLATE_ORD can handle longer data node bitmasks.
    
    WL #12564 : Modify DEFINE_BACKUP_REQ to allow more nodes
    
    Adapt DEFINE_BACKUP_REQ to deal with the node bitmask changes and
    add relevant upgrade/downgrade code.
    
    WL #12564 : Modify START_RECREQ to allow more nodes
    
    Adapt START_RECREQ to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL#12564: Support for EVENT_REP
    
    Handled StartReport with 5 bitmasks, ConnectCheckStarted with 2 bitmasks,
    InfoEvent extended to support lengths up to 4091 bytes and same for
    WarningEvent. Limit of data in EVENT_REP signal section set to
    1024 words. Limited code to handle MGM server in lower version than
    data node.
    
    Long info events must have type repeated in section.
    
    In SavedEventBuffer::scan()
    -  assert(data_len <= 25);
    +  require(data_len <= MAX_EVENT_REP_SIZE_WORDS);
    
    WL #12564 : Modify NODE_FAILREP to allow more nodes
    
    Adapt NODE_FAILREP to deal with the node bitmask changes and add relevant
    upgrade/downgrade code.
    
    WL# 12564: FAIL_REP fine tuning
    
    WL #12564 : Fix MASTER_GCPCONF
    
    WL#12564: ISOLATE_ORD
    
    WL# 12564: FAIL_REP fine tuning
    
    WL# 12564: START_LCP_REQ bitmask transmission improvements
    
    WL# 12564: PREP_FAILREQ and PREP_FAILREF cosmetic changes
    
    WL#12564: EVENT_REP problem in warningEvent
    
    WL# 12564: STOP_REQ improvements
    
    WL# 12564: CM_REGREQ, CM_REGCONF, CM_REGREF, DEFINE_BACKUP_REQ code improvements
    
    Use NdbNodeBitmask48::Size instead of 2.
    
    WL# 12564: START_RECREQ improvements
    
    WL# 12564: NODE_FAILREP cosmetic changes
    
    WL#12564: Support for EVENT_REP
    
    Handled StartReport with 5 bitmasks, ConnectCheckStarted with 2 bitmasks,
    InfoEvent extended to support lengths up to 4091 bytes and same for
    WarningEvent. Limit of data in EVENT_REP signal section set to
    1024 words. Limited code to handle MGM server in lower version than
    data node.
    
    WL#12564 Pass node bitmask in section for CNTR_WAITREP:ZWAITPOINT_4_2
    
    To old nodes continue to send it in signal.
    
    Bug#29211078 DANGLING REFERENCE TO BUFFER OUT OF SCOPE IN QMGR::FAILREPORTLAB
    
    Remove dangling reference to buffer out of scope by moving definition of
    buffer extra to same scope as pointer msg.
    
    WL#12564 Use TextLength constant for bitmask text buffer sizes.
    
    General replace of size expression for character buffers used in call to
    getText() to produce hexdump of node bitmasks.
    
    Typically changing
    
      char buf[100];
    
    to
    
      char buf[NdbNodeBitmask::TextLength + 1];
    
    WL# 12564: Send and receive bitmask in NDB_STARTCONF through signal section
    
    WL# 12564: Node bitmask in CONTINUEB in DBSPJ block through signal section
    
    WL# 12564: Zero the node bitmask in GSN_EVENT_REP for ignorance
    
    The node bitmask sent by the BACKUP block is not used at the receiver (mgm client).
    Hence, zero those bits for ignorance.
Loading