Skip to content
  • Sven Sandberg's avatar
    95a156f6
    BUG#19579811: SET GTID_PURGED DOES NOT STOP WAIT_FOR_EXECUTED_GTID_SET · 95a156f6
    Sven Sandberg authored
    Background:
    
    WAIT_FOR_EXECUTED_GTID_SET waits until a specified set of GTIDs is
    included in GTID_EXECUTED. SET GTID_PURGED adds GTIDs to
    GTID_EXECUTED. RESET MASTER clears GTID_EXECUTED.
    
    There were multiple issues:
    
     1. Problem:
    
        The change in GTID_EXECUTED implied by SET GTID_PURGED did
        not cause WAIT_FOR_EXECUTED_GTID_SET to stop waiting.
    
        Analysis:
    
        WAIT_FOR_EXECUTED_GTID_SET waits for a signal to be sent.
        But SET GTID_PURGED never sent the signal.
    
        Fix:
    
        Make GTID_PURGED send the signal.
    
        Changes:
        - sql/rpl_gtid_state.cc:Gtid_state::add_lost_gtids
        - sql/rpl_gtid_state.cc: removal of #ifdef HAVE_GTID_NEXT_LIST
        - sql/rpl_gtid.h: removal of #ifdef HAVE_GTID_NEXT_LIST
    
     2. Problem:
    
        There was a race condition where WAIT_FOR_EXECUTED_GTID_SET
        could miss the signal from a commit and go into an infinite
        wait even if GTID_EXECUTED contains all the waited-for GTIDs.
    
        Analysis:
    
        In the bug, WAIT_FOR_EXECUTED_GTID_SET took a lock while
        taking a copy of the global state. Then it released the lock,
        analyzed the copy of the global state, and decided whether it
        should wait.  But if the GTID to wait for was committed after
        the lock was released, WAIT_FOR_EXECUTED_GTID_SET would miss
        the signal and go to an infinite wait even if GTID_EXECUTED
        contains all the waited-for GTIDs.
    
        Fix:
    
        Refactor the code so that it holds the lock all the way from
        before it reads the global state until it goes to the wait.
    
        Changes:
    
        - sql/rpl_gtid_state.cc:Gtid_state::wait_for_gtid_set:
          Most of the changes in this function are to fix this bug.
    
        Note:
    
        When the bug existed, it was possible to create a test case
        for this by placing a debug sync point in the section where
        it does not hold the lock.  However, after the bug has been
        fixed this section does not exist, so there is no way to test
        it deterministically.  The bug would also cause the test to
        fail rarely, so a way to test this is to run the test case
        1000 times.
    
     3. Problem:
    
        The function would take global_sid_lock.wrlock every time it has
        to wait, and while holding it takes a copy of the entire
        gtid_executed (which implies allocating memory).  This is not very
        optimal: it may process the entire set each time it waits, and it
        may wait once for each member of the set, so in the worst case it
        is O(N^2) where N is the size of the set.
    
        Fix:
    
        This is fixed by the same refactoring that fixes problem #2.  In
        particular, it does not re-process the entire Gtid_set for each
        committed transaction. It only removes all intervals of
        gtid_executed for the current sidno from the remainder of the
        wait-for-set.
    
        Changes:
        - sql/rpl_gtid_set.cc: Add function remove_intervals_for_sidno.
        - sql/rpl_gtid_state.cc: Use remove_intervals_for_sidno and remove
          only intervals for the current sidno. Remove intervals
          incrementally in the innermost while loop, rather than recompute
          the entire set each iteration.
    
     4. Problem:
    
        If the client that executes WAIT_FOR_EXECUTED_GTID_SET owns a
        GTID that is included in the set, then there is no chance for
        another thread to commit it, so it will wait forever.  In
        effect, it deadlocks with itself.
    
        Fix:
    
        Detect the situation and generate an error.
    
        Changes:
        - sql/share/errmsg-utf8.txt: new error code
          ER_CANT_WAIT_FOR_EXECUTED_GTID_SET_WHILE_OWNING_A_GTID
        - sql/item_func.cc: check the condition and generate the new error
    
     5. Various simplfications.
    
        - sql/item_func.cc:Item_wait_for_executed_gtid_set::val_int:
          - Pointless to set null_value when generating an error.
          - add DBUG_ENTER
          - Improve the prototype for Gtid_state::wait_for_gtid_set so
            that it takes a Gtid_set instead of a string, and also so that
            it requires global_sid_lock.
        - sql/rpl_gtid.h:Mutex_cond_array
          - combine wait functions into one and make it return bool
          - improve some comments
        - sql/rpl_gtid_set.cc:Gtid_set::remove_gno_intervals:
          - Optimize so that it returns early if this set becomes empty
    
    @mysql-test/extra/rpl_tests/rpl_wait_for_executed_gtid_set.inc
    - Move all wait_for_executed_gtid_set tests into
      mysql-test/suite/rpl/t/rpl_wait_for_executed_gtid_set.test
    
    @mysql-test/include/kill_wait_for_executed_gtid_set.inc
    @mysql-test/include/wait_for_wait_for_executed_gtid_set.inc
    - New auxiliary scripts.
    
    @mysql-test/include/rpl_init.inc
    - Document undocumented side effect.
    
    @mysql-test/suite/rpl/r/rpl_wait_for_executed_gtid_set.result
    - Update result file.
    
    @mysql-test/suite/rpl/t/rpl_wait_for_executed_gtid_set.test
    - Rewrote the test to improve coverage and cover all parts of this bug.
    
    @sql/item_func.cc
    - Add DBUG_ENTER
    - No point in setting null_value when generating an error.
    - Do the decoding from text to Gtid_set here rather than in Gtid_state.
    - Check for the new error
      ER_CANT_WAIT_FOR_EXECUTED_GTID_SET_WHILE_OWNING_A_GTID
    
    @sql/rpl_gtid.h
    - Simplify the Mutex_cond_array::wait functions in the following ways:
      - Make them one function since they share most code. This also allows
        calling the three-argument function with NULL as the last
        parameter, which simplifies the caller.
      - Make it return bool rather than 0/ETIME/ETIMEOUT, to make it more
        easy to use.
    - Make is_thd_killed private.
    - Add prototype for new Gtid_set::remove_intervals_for_sidno.
    - Add prototype for Gtid_state::wait_for_sidno.
    - Un-ifdef-out lock_sidnos/unlock_sidnos/broadcast_sidnos since we now
      need them.
    - Make wait_for_gtid_set return bool.
    
    @sql/rpl_gtid_mutex_cond_array.cc
    - Remove the now unused check_thd_killed.
    
    @sql/rpl_gtid_set.cc
    - Optimize Gtid_set::remove_gno_intervals, so that it returns early
      if the Interval list becomes empty.
    - Add Gtid_set::remove_intervals_for_sidno. This is just a wrapper
      around the already existing private member function
      Gtid_set::remove_gno_intervals.
    
    @sql/rpl_gtid_state.cc
    - Rewrite wait_for_gtid_set to fix problems 2 and 3. See code
      comment for details.
    - Factor out wait_for_sidno from wait_for_gtid.
    - Enable broadcast_sidnos/lock_sidnos/unlock_sidnos, which were ifdef'ed out.
    - Call broadcast_sidnos after updating the state, to fix issue #1.
    
    @sql/share/errmsg-utf8.txt
    - Add error message used to fix issue #4.
    95a156f6
    BUG#19579811: SET GTID_PURGED DOES NOT STOP WAIT_FOR_EXECUTED_GTID_SET
    Sven Sandberg authored
    Background:
    
    WAIT_FOR_EXECUTED_GTID_SET waits until a specified set of GTIDs is
    included in GTID_EXECUTED. SET GTID_PURGED adds GTIDs to
    GTID_EXECUTED. RESET MASTER clears GTID_EXECUTED.
    
    There were multiple issues:
    
     1. Problem:
    
        The change in GTID_EXECUTED implied by SET GTID_PURGED did
        not cause WAIT_FOR_EXECUTED_GTID_SET to stop waiting.
    
        Analysis:
    
        WAIT_FOR_EXECUTED_GTID_SET waits for a signal to be sent.
        But SET GTID_PURGED never sent the signal.
    
        Fix:
    
        Make GTID_PURGED send the signal.
    
        Changes:
        - sql/rpl_gtid_state.cc:Gtid_state::add_lost_gtids
        - sql/rpl_gtid_state.cc: removal of #ifdef HAVE_GTID_NEXT_LIST
        - sql/rpl_gtid.h: removal of #ifdef HAVE_GTID_NEXT_LIST
    
     2. Problem:
    
        There was a race condition where WAIT_FOR_EXECUTED_GTID_SET
        could miss the signal from a commit and go into an infinite
        wait even if GTID_EXECUTED contains all the waited-for GTIDs.
    
        Analysis:
    
        In the bug, WAIT_FOR_EXECUTED_GTID_SET took a lock while
        taking a copy of the global state. Then it released the lock,
        analyzed the copy of the global state, and decided whether it
        should wait.  But if the GTID to wait for was committed after
        the lock was released, WAIT_FOR_EXECUTED_GTID_SET would miss
        the signal and go to an infinite wait even if GTID_EXECUTED
        contains all the waited-for GTIDs.
    
        Fix:
    
        Refactor the code so that it holds the lock all the way from
        before it reads the global state until it goes to the wait.
    
        Changes:
    
        - sql/rpl_gtid_state.cc:Gtid_state::wait_for_gtid_set:
          Most of the changes in this function are to fix this bug.
    
        Note:
    
        When the bug existed, it was possible to create a test case
        for this by placing a debug sync point in the section where
        it does not hold the lock.  However, after the bug has been
        fixed this section does not exist, so there is no way to test
        it deterministically.  The bug would also cause the test to
        fail rarely, so a way to test this is to run the test case
        1000 times.
    
     3. Problem:
    
        The function would take global_sid_lock.wrlock every time it has
        to wait, and while holding it takes a copy of the entire
        gtid_executed (which implies allocating memory).  This is not very
        optimal: it may process the entire set each time it waits, and it
        may wait once for each member of the set, so in the worst case it
        is O(N^2) where N is the size of the set.
    
        Fix:
    
        This is fixed by the same refactoring that fixes problem #2.  In
        particular, it does not re-process the entire Gtid_set for each
        committed transaction. It only removes all intervals of
        gtid_executed for the current sidno from the remainder of the
        wait-for-set.
    
        Changes:
        - sql/rpl_gtid_set.cc: Add function remove_intervals_for_sidno.
        - sql/rpl_gtid_state.cc: Use remove_intervals_for_sidno and remove
          only intervals for the current sidno. Remove intervals
          incrementally in the innermost while loop, rather than recompute
          the entire set each iteration.
    
     4. Problem:
    
        If the client that executes WAIT_FOR_EXECUTED_GTID_SET owns a
        GTID that is included in the set, then there is no chance for
        another thread to commit it, so it will wait forever.  In
        effect, it deadlocks with itself.
    
        Fix:
    
        Detect the situation and generate an error.
    
        Changes:
        - sql/share/errmsg-utf8.txt: new error code
          ER_CANT_WAIT_FOR_EXECUTED_GTID_SET_WHILE_OWNING_A_GTID
        - sql/item_func.cc: check the condition and generate the new error
    
     5. Various simplfications.
    
        - sql/item_func.cc:Item_wait_for_executed_gtid_set::val_int:
          - Pointless to set null_value when generating an error.
          - add DBUG_ENTER
          - Improve the prototype for Gtid_state::wait_for_gtid_set so
            that it takes a Gtid_set instead of a string, and also so that
            it requires global_sid_lock.
        - sql/rpl_gtid.h:Mutex_cond_array
          - combine wait functions into one and make it return bool
          - improve some comments
        - sql/rpl_gtid_set.cc:Gtid_set::remove_gno_intervals:
          - Optimize so that it returns early if this set becomes empty
    
    @mysql-test/extra/rpl_tests/rpl_wait_for_executed_gtid_set.inc
    - Move all wait_for_executed_gtid_set tests into
      mysql-test/suite/rpl/t/rpl_wait_for_executed_gtid_set.test
    
    @mysql-test/include/kill_wait_for_executed_gtid_set.inc
    @mysql-test/include/wait_for_wait_for_executed_gtid_set.inc
    - New auxiliary scripts.
    
    @mysql-test/include/rpl_init.inc
    - Document undocumented side effect.
    
    @mysql-test/suite/rpl/r/rpl_wait_for_executed_gtid_set.result
    - Update result file.
    
    @mysql-test/suite/rpl/t/rpl_wait_for_executed_gtid_set.test
    - Rewrote the test to improve coverage and cover all parts of this bug.
    
    @sql/item_func.cc
    - Add DBUG_ENTER
    - No point in setting null_value when generating an error.
    - Do the decoding from text to Gtid_set here rather than in Gtid_state.
    - Check for the new error
      ER_CANT_WAIT_FOR_EXECUTED_GTID_SET_WHILE_OWNING_A_GTID
    
    @sql/rpl_gtid.h
    - Simplify the Mutex_cond_array::wait functions in the following ways:
      - Make them one function since they share most code. This also allows
        calling the three-argument function with NULL as the last
        parameter, which simplifies the caller.
      - Make it return bool rather than 0/ETIME/ETIMEOUT, to make it more
        easy to use.
    - Make is_thd_killed private.
    - Add prototype for new Gtid_set::remove_intervals_for_sidno.
    - Add prototype for Gtid_state::wait_for_sidno.
    - Un-ifdef-out lock_sidnos/unlock_sidnos/broadcast_sidnos since we now
      need them.
    - Make wait_for_gtid_set return bool.
    
    @sql/rpl_gtid_mutex_cond_array.cc
    - Remove the now unused check_thd_killed.
    
    @sql/rpl_gtid_set.cc
    - Optimize Gtid_set::remove_gno_intervals, so that it returns early
      if the Interval list becomes empty.
    - Add Gtid_set::remove_intervals_for_sidno. This is just a wrapper
      around the already existing private member function
      Gtid_set::remove_gno_intervals.
    
    @sql/rpl_gtid_state.cc
    - Rewrite wait_for_gtid_set to fix problems 2 and 3. See code
      comment for details.
    - Factor out wait_for_sidno from wait_for_gtid.
    - Enable broadcast_sidnos/lock_sidnos/unlock_sidnos, which were ifdef'ed out.
    - Call broadcast_sidnos after updating the state, to fix issue #1.
    
    @sql/share/errmsg-utf8.txt
    - Add error message used to fix issue #4.
Loading