Skip to content
  • Frazer Clement's avatar
    a8598789
    Bug #19875710 NDB : COMMIT ACK MARKERS LEAK @ LQH IN 7.4 · a8598789
    Frazer Clement authored
          
    Fix TC ref counting of Commit Ack markers in LQH so as not
    to leak markers at LQH.
          
    TC has one TC-Commit-Ack-Marker record per transaction which
    is used to track which nodes and LDM instances hold 
    LQH-Commit-Ack-Marker records.
          
    This is used when receiving TC_COMMIT_ACK to know which nodes
    and LDM instances should be sent a REMOVE_MARKER_ORD signal.
          
    TC only needs one operation per transaction to have 
    LQH-Commit-Ack-marker records (in each live node in one
    nodegroup), so the approach taken is to request them 
    for all write operations until one of the write 
    operations succeeds (and keeps its marker at LQH).  
    After this, subsequent write operations needn't allocate 
    markers at LQH.
          
    Write operations that don't succeed and don't immediately
    cause a transaction abort (e.g. those defined with
    IgnoreError, and which e.g. find no row, or row already exists
    or something) are aborted (and discarded at LQH), and so they
    leave no LQH-Commit-Ack marker.
          
    Where a transaction prepares write operations that all fail at
    LQH, there will be no LQH-Commit-Ack markers, and so no need
    for a TC-Commit-Ack marker.  This is handled using a reference
    count of how many LQH-Commit-Ack markers have been requested
    *or acknowledged*.  If this becomes == 0 then there's no need
    for a TC-Commit-Ack marker.
          
    TC uses a per-transaction state and a per-transaction reference
    counter to manage this.
          
    The bug is that the reference count was only covering the 
    outstanding requests, and not the LQH-Commit-Ack markers that
    were acknowledged.  In other words the reference count was 
    decremented in execLQHKEYCONF, which signified that an LQH-Commit-Ack
    marker was allocated on that LQH instance.
          
    In certain situations this resulted in the allocated LQH-Commit-Ack
    markers being leaked, and eventually this causes the cluster to become
    read only as new write operations cannot allocate LQH-Commit-Ack markers.
          
    Bug seems to have been added as part of
      Bug #19451060 	BUG#73339 IN MYSQL BUG SYSTEM, NDBREQUIRE INCORRECT
          
    Fix is to *not* decrement the reference count in execLQHKEYCONF.
    
    However, the current implementation 'forgets' that an operation resulted in
    marker allocation (and reference count increment) after LQHKEYCONF is
    processed.
    
    To solve this, TC is modified to record which operations caused 
    LQH-Commit-Ack markers to be allocated, so that during the 
    per-operation phase of transaction ABORT or COMMIT, the 
    reference count can be decremented and so re-checked for 
    consistency.
    
          
    Some additional jam()s and comments are added.
          
    A new ndbinfo.ndb$pools pool is added - LQH Commit Ack Markers.  
    This is used in the testcase to ensure that all LQH Commit Ack 
    markers are released, and may be useful for problem diagnosis
    in future.
          
    Replication used in the test to get batching of writing operations
    and NdbApi AO_IgnoreError flag setting.
          
    Some basic transaction abort testcases are added which showed problems
    with a partial fix.
    a8598789
    Bug #19875710 NDB : COMMIT ACK MARKERS LEAK @ LQH IN 7.4
    Frazer Clement authored
          
    Fix TC ref counting of Commit Ack markers in LQH so as not
    to leak markers at LQH.
          
    TC has one TC-Commit-Ack-Marker record per transaction which
    is used to track which nodes and LDM instances hold 
    LQH-Commit-Ack-Marker records.
          
    This is used when receiving TC_COMMIT_ACK to know which nodes
    and LDM instances should be sent a REMOVE_MARKER_ORD signal.
          
    TC only needs one operation per transaction to have 
    LQH-Commit-Ack-marker records (in each live node in one
    nodegroup), so the approach taken is to request them 
    for all write operations until one of the write 
    operations succeeds (and keeps its marker at LQH).  
    After this, subsequent write operations needn't allocate 
    markers at LQH.
          
    Write operations that don't succeed and don't immediately
    cause a transaction abort (e.g. those defined with
    IgnoreError, and which e.g. find no row, or row already exists
    or something) are aborted (and discarded at LQH), and so they
    leave no LQH-Commit-Ack marker.
          
    Where a transaction prepares write operations that all fail at
    LQH, there will be no LQH-Commit-Ack markers, and so no need
    for a TC-Commit-Ack marker.  This is handled using a reference
    count of how many LQH-Commit-Ack markers have been requested
    *or acknowledged*.  If this becomes == 0 then there's no need
    for a TC-Commit-Ack marker.
          
    TC uses a per-transaction state and a per-transaction reference
    counter to manage this.
          
    The bug is that the reference count was only covering the 
    outstanding requests, and not the LQH-Commit-Ack markers that
    were acknowledged.  In other words the reference count was 
    decremented in execLQHKEYCONF, which signified that an LQH-Commit-Ack
    marker was allocated on that LQH instance.
          
    In certain situations this resulted in the allocated LQH-Commit-Ack
    markers being leaked, and eventually this causes the cluster to become
    read only as new write operations cannot allocate LQH-Commit-Ack markers.
          
    Bug seems to have been added as part of
      Bug #19451060 	BUG#73339 IN MYSQL BUG SYSTEM, NDBREQUIRE INCORRECT
          
    Fix is to *not* decrement the reference count in execLQHKEYCONF.
    
    However, the current implementation 'forgets' that an operation resulted in
    marker allocation (and reference count increment) after LQHKEYCONF is
    processed.
    
    To solve this, TC is modified to record which operations caused 
    LQH-Commit-Ack markers to be allocated, so that during the 
    per-operation phase of transaction ABORT or COMMIT, the 
    reference count can be decremented and so re-checked for 
    consistency.
    
          
    Some additional jam()s and comments are added.
          
    A new ndbinfo.ndb$pools pool is added - LQH Commit Ack Markers.  
    This is used in the testcase to ensure that all LQH Commit Ack 
    markers are released, and may be useful for problem diagnosis
    in future.
          
    Replication used in the test to get batching of writing operations
    and NdbApi AO_IgnoreError flag setting.
          
    Some basic transaction abort testcases are added which showed problems
    with a partial fix.
Loading