plugin/group_replication/src/ps_information.cc · 7d10c82196c8e45554f27c00681474a9fb86d137 · Rasoul Jahanshahi / Mysql Server

Mar 01, 2020

Bug#30049349 - SEGFAULT IN GROUP_REPLICATION.S: GET_GROUP_MEMBER_STATS · 54b635b4

Venkatesh Venugopal authored Mar 01, 2020

Bug#30791583 - CRASH IN CERTIFIER::~CERTIFIER() ON STOP
               GROUP_REPLICATION COMMAND

Problem & Analysis
==================
Querying the performance_schema.replication_group_member_stats can
sometimes crash the server because of the following reasons.

1. The Group Replication perfschema code has below pattern.

   callbacks.set_last_conflict_free_transaction(
        callbacks.context,
        *pipeline_stats->get_transaction_last_conflict_free().c_str(),
        pipeline_stats->get_transaction_last_conflict_free().length());

Here, the Pipeline_member_stats' m_transaction_last_conflict_free and
m_transactions_committed_all_members are not protected by any lock.
So, these members are not thread-safe and thus may result in undefined
behavior when the value gets updated between the c_str() and length()
functions.

2. GR perfschema code is not fully thread-safe. As a result,

2.1. There is chance that applier_module may get deleted by STOP
     GROUP_REPLICATION query while some thread is executing PS query.
     This causes the P_S query to hit segmentation fault when it
     accesses applier_module.

2.2. There is a chance that the group can undergo change while the P_S
     query is in progress. When this happnes, in debug build, the thread
     hits an assertion failure in
     table_replication_group_member_stats.cc

        DBUG_ASSERT(m_pos.m_index < get_row_count());

     while fetching the row by position.

Fix
===
1. Instead of returning the internal memory buffer, we now pass a local
memory buffer to fill the value and pass its value and length to the
function.

2.1. To fix the issues with STOP GROUP_REPLICATION query,

- A new read-write lock has been added to protect access to the
  std::map<std::string, Pipeline_member_stats> Flow_control_module_info
  against any updates received from the process_notification_thread.

  This is required to make sure that the iterator used for fetching the
  Pipeline_member_stats is valid till the value is copied.

- The P_S query now takes the applier thread's run_lock for a small
  duration while fetching the local member stats in order to be not
  deleted by the STOP GROUP_REPLICATION query.

2.2. Since we cannot block concurrent actions that come from group
communication, the assert has been converted in error
HA_ERR_RECORD_DELETED.

RB: 23924
Reviewed by: Nuno Carvalho <nuno.carvalho@oracle.com>
Reviewed by: Jaideep Karande <jaideep.karande@oracle.com>

54b635b4

Bug#30049349 - SEGFAULT IN GROUP_REPLICATION.S: GET_GROUP_MEMBER_STATS

Venkatesh Venugopal authored Mar 01, 2020

Bug#30791583 - CRASH IN CERTIFIER::~CERTIFIER() ON STOP
               GROUP_REPLICATION COMMAND

Problem & Analysis
==================
Querying the performance_schema.replication_group_member_stats can
sometimes crash the server because of the following reasons.

1. The Group Replication perfschema code has below pattern.

   callbacks.set_last_conflict_free_transaction(
        callbacks.context,
        *pipeline_stats->get_transaction_last_conflict_free().c_str(),
        pipeline_stats->get_transaction_last_conflict_free().length());

Here, the Pipeline_member_stats' m_transaction_last_conflict_free and
m_transactions_committed_all_members are not protected by any lock.
So, these members are not thread-safe and thus may result in undefined
behavior when the value gets updated between the c_str() and length()
functions.

2. GR perfschema code is not fully thread-safe. As a result,

2.1. There is chance that applier_module may get deleted by STOP
     GROUP_REPLICATION query while some thread is executing PS query.
     This causes the P_S query to hit segmentation fault when it
     accesses applier_module.

2.2. There is a chance that the group can undergo change while the P_S
     query is in progress. When this happnes, in debug build, the thread
     hits an assertion failure in
     table_replication_group_member_stats.cc

        DBUG_ASSERT(m_pos.m_index < get_row_count());

     while fetching the row by position.

Fix
===
1. Instead of returning the internal memory buffer, we now pass a local
memory buffer to fill the value and pass its value and length to the
function.

2.1. To fix the issues with STOP GROUP_REPLICATION query,

- A new read-write lock has been added to protect access to the
  std::map<std::string, Pipeline_member_stats> Flow_control_module_info
  against any updates received from the process_notification_thread.

  This is required to make sure that the iterator used for fetching the
  Pipeline_member_stats is valid till the value is copied.

- The P_S query now takes the applier thread's run_lock for a small
  duration while fetching the local member stats in order to be not
  deleted by the STOP GROUP_REPLICATION query.

2.2. Since we cannot block concurrent actions that come from group
communication, the assert has been converted in error
HA_ERR_RECORD_DELETED.

RB: 23924
Reviewed by: Nuno Carvalho <nuno.carvalho@oracle.com>
Reviewed by: Jaideep Karande <jaideep.karande@oracle.com>