mysql-test/r/bench_count_distinct.result · a01f595f95b88dd3d6fbf8a3034b0de69c4a2223 · Rasoul Jahanshahi / Mysql Server

Jun 03, 2014

Fix for Bug#18497308 WRONG COST ESTIMATE FOR LOOSE INDEX SCAN WHEN · 0670d0b9

Olav Sandstaa authored Jun 03, 2014

                     INDEX STATISTICS IS MISSING 

The cost for doing loose index scan is calculated in
cost_group_min_max(). To estimate the size of the groups to read,
index statistics from rec_per_key is used. The code works correctly
when we have index statistics but if index statistics is not
available, there is a bug in the code that in some cases will cause
wrong cost estimates to be produced.

For a query like this:

  SELECT DISTINCT a FROM t1 WHERE b=4;

with an index on (a,b,c), cost_group_min_max() will estimate the "keys
per group" both for the index prefix (a) and the index
prefix(a,b). The "keys per group" the first of these, is estimated by
looking up in rec_per_key[0] and using the following code to adjust
the estimate if there is no rec_per_key information:

  /* Compute the number of keys in a group. */
  keys_per_group= index_info->rec_per_key[group_key_parts - 1];
  if (keys_per_group == 0) /* If there is no statistics try to guess */
    /* each group contains 10% of all records */
    keys_per_group= (uint)(table_records / 10) + 1;

For estimating the second "keys per group", rec_per_key[1] is used and
the code looks like this:

  keys_per_subgroup= index_info->rec_per_key[used_key_parts - 1];

There is no code that handles the case when index statistics is
missing. In this case keys_per_subgroup get the value 0. Using 0 as an
estimate in the code that follows causes wrong cost estimates to be
produced.

To ensure that keys_per_subgroup always has a sensible value, this
patch adds code that handles the case where index statistics is
missing. A new function guess_rec_per_key() is added. This function
is based on existing code in find_best_ref() that assumes for an
index with multiple parts, the first key part will match 1% of the
records and the whole key will match 10 records in the case of
a non-unique key and 1 record in the case of a unique key. For
other numbers of used key parts, a formula is used for computing
an estimated rec_per_key that is between the rec_per_key value for
the first key part and the rec_per_key value for the whole key.

This new function is used in both cases in cost_group_min_max().
Replacing the current handling of missing index statistics causes
changes to explain output in multiple test cases. The main reason
for these changes are that the current code assumes that the
index will have ten unique values, while the new
guess_rec_per_key() function assumes that for each value in a
non-unique index there are ten entries. For small tables, this
results in different rec_per_key estimates.

0670d0b9

Fix for Bug#18497308 WRONG COST ESTIMATE FOR LOOSE INDEX SCAN WHEN

Olav Sandstaa authored Jun 03, 2014

                     INDEX STATISTICS IS MISSING 

The cost for doing loose index scan is calculated in
cost_group_min_max(). To estimate the size of the groups to read,
index statistics from rec_per_key is used. The code works correctly
when we have index statistics but if index statistics is not
available, there is a bug in the code that in some cases will cause
wrong cost estimates to be produced.

For a query like this:

  SELECT DISTINCT a FROM t1 WHERE b=4;

with an index on (a,b,c), cost_group_min_max() will estimate the "keys
per group" both for the index prefix (a) and the index
prefix(a,b). The "keys per group" the first of these, is estimated by
looking up in rec_per_key[0] and using the following code to adjust
the estimate if there is no rec_per_key information:

  /* Compute the number of keys in a group. */
  keys_per_group= index_info->rec_per_key[group_key_parts - 1];
  if (keys_per_group == 0) /* If there is no statistics try to guess */
    /* each group contains 10% of all records */
    keys_per_group= (uint)(table_records / 10) + 1;

For estimating the second "keys per group", rec_per_key[1] is used and
the code looks like this:

  keys_per_subgroup= index_info->rec_per_key[used_key_parts - 1];

There is no code that handles the case when index statistics is
missing. In this case keys_per_subgroup get the value 0. Using 0 as an
estimate in the code that follows causes wrong cost estimates to be
produced.

To ensure that keys_per_subgroup always has a sensible value, this
patch adds code that handles the case where index statistics is
missing. A new function guess_rec_per_key() is added. This function
is based on existing code in find_best_ref() that assumes for an
index with multiple parts, the first key part will match 1% of the
records and the whole key will match 10 records in the case of
a non-unique key and 1 record in the case of a unique key. For
other numbers of used key parts, a formula is used for computing
an estimated rec_per_key that is between the rec_per_key value for
the first key part and the rec_per_key value for the whole key.

This new function is used in both cases in cost_group_min_max().
Replacing the current handling of missing index statistics causes
changes to explain output in multiple test cases. The main reason
for these changes are that the current code assumes that the
index will have ten unique values, while the new
guess_rec_per_key() function assumes that for each value in a
non-unique index there are ten entries. For small tables, this
results in different rec_per_key estimates.