Skip to content
  • Venkatesh Venugopal's avatar
    54b635b4
    Bug#30049349 - SEGFAULT IN GROUP_REPLICATION.S: GET_GROUP_MEMBER_STATS · 54b635b4
    Venkatesh Venugopal authored
    Bug#30791583 - CRASH IN CERTIFIER::~CERTIFIER() ON STOP
                   GROUP_REPLICATION COMMAND
    
    Problem & Analysis
    ==================
    Querying the performance_schema.replication_group_member_stats can
    sometimes crash the server because of the following reasons.
    
    1. The Group Replication perfschema code has below pattern.
    
       callbacks.set_last_conflict_free_transaction(
            callbacks.context,
            *pipeline_stats->get_transaction_last_conflict_free().c_str(),
            pipeline_stats->get_transaction_last_conflict_free().length());
    
    Here, the Pipeline_member_stats' m_transaction_last_conflict_free and
    m_transactions_committed_all_members are not protected by any lock.
    So, these members are not thread-safe and thus may result in undefined
    behavior when the value gets updated between the c_str() and length()
    functions.
    
    2. GR perfschema code is not fully thread-safe. As a result,
    
    2.1. There is chance that applier_module may get deleted by STOP
         GROUP_REPLICATION query while some thread is executing PS query.
         This causes the P_S query to hit segmentation fault when it
         accesses applier_module.
    
    2.2. There is a chance that the group can undergo change while the P_S
         query is in progress. When this happnes, in debug build, the thread
         hits an assertion failure in
         table_replication_group_member_stats.cc
    
            DBUG_ASSERT(m_pos.m_index < get_row_count());
    
         while fetching the row by position.
    
    Fix
    ===
    1. Instead of returning the internal memory buffer, we now pass a local
    memory buffer to fill the value and pass its value and length to the
    function.
    
    2.1. To fix the issues with STOP GROUP_REPLICATION query,
    
    - A new read-write lock has been added to protect access to the
      std::map<std::string, Pipeline_member_stats> Flow_control_module_info
      against any updates received from the process_notification_thread.
    
      This is required to make sure that the iterator used for fetching the
      Pipeline_member_stats is valid till the value is copied.
    
    - The P_S query now takes the applier thread's run_lock for a small
      duration while fetching the local member stats in order to be not
      deleted by the STOP GROUP_REPLICATION query.
    
    2.2. Since we cannot block concurrent actions that come from group
    communication, the assert has been converted in error
    HA_ERR_RECORD_DELETED.
    
    RB: 23924
    Reviewed by: Nuno Carvalho <nuno.carvalho@oracle.com>
    Reviewed by: Jaideep Karande <jaideep.karande@oracle.com>
    54b635b4
    Bug#30049349 - SEGFAULT IN GROUP_REPLICATION.S: GET_GROUP_MEMBER_STATS
    Venkatesh Venugopal authored
    Bug#30791583 - CRASH IN CERTIFIER::~CERTIFIER() ON STOP
                   GROUP_REPLICATION COMMAND
    
    Problem & Analysis
    ==================
    Querying the performance_schema.replication_group_member_stats can
    sometimes crash the server because of the following reasons.
    
    1. The Group Replication perfschema code has below pattern.
    
       callbacks.set_last_conflict_free_transaction(
            callbacks.context,
            *pipeline_stats->get_transaction_last_conflict_free().c_str(),
            pipeline_stats->get_transaction_last_conflict_free().length());
    
    Here, the Pipeline_member_stats' m_transaction_last_conflict_free and
    m_transactions_committed_all_members are not protected by any lock.
    So, these members are not thread-safe and thus may result in undefined
    behavior when the value gets updated between the c_str() and length()
    functions.
    
    2. GR perfschema code is not fully thread-safe. As a result,
    
    2.1. There is chance that applier_module may get deleted by STOP
         GROUP_REPLICATION query while some thread is executing PS query.
         This causes the P_S query to hit segmentation fault when it
         accesses applier_module.
    
    2.2. There is a chance that the group can undergo change while the P_S
         query is in progress. When this happnes, in debug build, the thread
         hits an assertion failure in
         table_replication_group_member_stats.cc
    
            DBUG_ASSERT(m_pos.m_index < get_row_count());
    
         while fetching the row by position.
    
    Fix
    ===
    1. Instead of returning the internal memory buffer, we now pass a local
    memory buffer to fill the value and pass its value and length to the
    function.
    
    2.1. To fix the issues with STOP GROUP_REPLICATION query,
    
    - A new read-write lock has been added to protect access to the
      std::map<std::string, Pipeline_member_stats> Flow_control_module_info
      against any updates received from the process_notification_thread.
    
      This is required to make sure that the iterator used for fetching the
      Pipeline_member_stats is valid till the value is copied.
    
    - The P_S query now takes the applier thread's run_lock for a small
      duration while fetching the local member stats in order to be not
      deleted by the STOP GROUP_REPLICATION query.
    
    2.2. Since we cannot block concurrent actions that come from group
    communication, the assert has been converted in error
    HA_ERR_RECORD_DELETED.
    
    RB: 23924
    Reviewed by: Nuno Carvalho <nuno.carvalho@oracle.com>
    Reviewed by: Jaideep Karande <jaideep.karande@oracle.com>
Loading