Skip to content
  • Andrei Elkin's avatar
    d6397485
    Bug#13893363 - MTS IS MISSING THE ABILITY TO STOP A SLAVE AFTER PROCESSING GAPS · d6397485
    Andrei Elkin authored
    BUG#13893310 checkpoint_group size wrong at recovery after cold restart
    
    This is combined patch for 3 issues.
    
    Bug#13893363.
    
    The new UNTIL condition is an important feature to have because
    of --relay-log-recovery=1 and Change-Master can *not* run in
    presence of gaps.
    The user would have to execute
      START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
    if he needed to switch from the parallel to the sequential execution mode
    after slave SQL thread or Worker threads errored out in the parallel mode.
    Thus UNTIL SQL_AFTER_MTS_GAPS gives the user a facility to find out
    the exact after gaps position automatically instead of having to figure
    it out of relay logs and infos himself.
    
    Also, a separate issue of incorrect demotion of
    DEADLOCK/WAIT_FOR_LOCK errors into warning is fixed because at
    Worker execution slave does not retry.  And a todo to relocate
    SQL_AFTER_MTS_GAPS and other post-exec/schedule until options
    checking at the end of read-execute loop (instead of to have them
    right after read phase which can lead to unnecessary hanging when
    a condition is actually met).
    
    
    Bug#13893310.
    
    The issue with checkpoint_group at MTS recovery is that after the
    server restart MTS recovery gaps collecting algorithm initialized
    the recovery bitmap with the default 512 size rather than with a
    correct one with size of not less than of Worker group_executed
    of the last slave session.
    
    That is corrected. The max possible size is used in the gaps
    collecting.  opt_mts_checkpoint_group 's update step is made as
    8 (bits).  Some refactoring in rpl_info*, rpl_rli_pdb is done,
    MTS recovery gaps collecting is deployed on a common to 
    START-SLAVE and --skip-start-slave=0 execution path.
    Few found small bugs fixed along the way, incl. 
    demotion of Worker DEADLOCK/WAIT_FOR_LOCK errors into warning.
    
    
    An associated to the bug valgrind issue is fixed, see the stack
    below, vai deploying handler->end_info() in the error branch of
    RLI::init_info().
    
    
    http://pb2.no.oracle.com/?template=mysql_show_test_failure&test_failure_id=4266844
    
    rpl.rpl_parallel_change_master           w5 [ fail ]  Found warnings/errors in server log file!
            Test ended at 2012-04-02 05:18:16
    line
    ==15911== 8,192 bytes in 1 blocks are definitely lost in loss record 130 of 179
    ==15911==    at 0x4C216FB: malloc (vg_replace_malloc.c:236)
    ==15911==    by 0xA29FE2: my_malloc (my_malloc.c:38)
    ==15911==    by 0xA09EFD: init_io_cache (mf_iocache.c:232)
    ==15911==    by 0x9FD1E3: Rpl_info_file::do_init_info(unsigned long const*, unsigned int) (rpl_info_file.cc:95)
    ==15911==    by 0x9EC97B: Rpl_info_handler::init_info(unsigned long const*, unsigned int) (rpl_info_handler.h:45)
    ==15911==    by 0x9F153E: Relay_log_info::init_info() (rpl_rli.cc:1676)
    ==15911==    by 0x9E688B: init_info(Master_info*, bool, int) (rpl_slave.cc:468)
    ==15911==    by 0x9E9AB1: init_slave() (rpl_slave.cc:316)
    ==15911==    by 0x5CFE18: mysqld_main(int, char**) (mysqld.cc:5083)
    d6397485
    Bug#13893363 - MTS IS MISSING THE ABILITY TO STOP A SLAVE AFTER PROCESSING GAPS
    Andrei Elkin authored
    BUG#13893310 checkpoint_group size wrong at recovery after cold restart
    
    This is combined patch for 3 issues.
    
    Bug#13893363.
    
    The new UNTIL condition is an important feature to have because
    of --relay-log-recovery=1 and Change-Master can *not* run in
    presence of gaps.
    The user would have to execute
      START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
    if he needed to switch from the parallel to the sequential execution mode
    after slave SQL thread or Worker threads errored out in the parallel mode.
    Thus UNTIL SQL_AFTER_MTS_GAPS gives the user a facility to find out
    the exact after gaps position automatically instead of having to figure
    it out of relay logs and infos himself.
    
    Also, a separate issue of incorrect demotion of
    DEADLOCK/WAIT_FOR_LOCK errors into warning is fixed because at
    Worker execution slave does not retry.  And a todo to relocate
    SQL_AFTER_MTS_GAPS and other post-exec/schedule until options
    checking at the end of read-execute loop (instead of to have them
    right after read phase which can lead to unnecessary hanging when
    a condition is actually met).
    
    
    Bug#13893310.
    
    The issue with checkpoint_group at MTS recovery is that after the
    server restart MTS recovery gaps collecting algorithm initialized
    the recovery bitmap with the default 512 size rather than with a
    correct one with size of not less than of Worker group_executed
    of the last slave session.
    
    That is corrected. The max possible size is used in the gaps
    collecting.  opt_mts_checkpoint_group 's update step is made as
    8 (bits).  Some refactoring in rpl_info*, rpl_rli_pdb is done,
    MTS recovery gaps collecting is deployed on a common to 
    START-SLAVE and --skip-start-slave=0 execution path.
    Few found small bugs fixed along the way, incl. 
    demotion of Worker DEADLOCK/WAIT_FOR_LOCK errors into warning.
    
    
    An associated to the bug valgrind issue is fixed, see the stack
    below, vai deploying handler->end_info() in the error branch of
    RLI::init_info().
    
    
    http://pb2.no.oracle.com/?template=mysql_show_test_failure&test_failure_id=4266844
    
    rpl.rpl_parallel_change_master           w5 [ fail ]  Found warnings/errors in server log file!
            Test ended at 2012-04-02 05:18:16
    line
    ==15911== 8,192 bytes in 1 blocks are definitely lost in loss record 130 of 179
    ==15911==    at 0x4C216FB: malloc (vg_replace_malloc.c:236)
    ==15911==    by 0xA29FE2: my_malloc (my_malloc.c:38)
    ==15911==    by 0xA09EFD: init_io_cache (mf_iocache.c:232)
    ==15911==    by 0x9FD1E3: Rpl_info_file::do_init_info(unsigned long const*, unsigned int) (rpl_info_file.cc:95)
    ==15911==    by 0x9EC97B: Rpl_info_handler::init_info(unsigned long const*, unsigned int) (rpl_info_handler.h:45)
    ==15911==    by 0x9F153E: Relay_log_info::init_info() (rpl_rli.cc:1676)
    ==15911==    by 0x9E688B: init_info(Master_info*, bool, int) (rpl_slave.cc:468)
    ==15911==    by 0x9E9AB1: init_slave() (rpl_slave.cc:316)
    ==15911==    by 0x5CFE18: mysqld_main(int, char**) (mysqld.cc:5083)
Loading