mysql-test/t/distinct.test · mysql-5.7.11 · Rasoul Jahanshahi / Mysql Server

Feb 06, 2015

WL#6986: Make switching of index due to small limit cost based · 1d05b689

Chaithra Gopalareddy authored Feb 06, 2015

This worklog makes the optimization of queries of the form
"SELECT .. ORDER BY LIMIT"  cost based.

Problem:
The decision to use an index that gives ordered results is not
cost based in all cases when optimizing "order by limit" queries.
This can result in a bad query plan for some queries.
Analysis:
Using the following example query from Bug#16522053
SELECT patientId, time
FROM t1
WHERE patientId > 41 AND patientId < 43 AND time > 0 AND
illness="Headache" ORDER BY time LIMIT 1;

1. During the initial access method estimations 5.6 selects
the same index and access strategy as 5.5 as the best:
Index merge on (patientId, primary key)(note that this
strategy will require file sort of the final result)

2. During join optimization this access method is kept as
the best access method.

3. After join optimization, the optimizer sees if there
are further opportunities for optimizing, and in this case
there is: The query contains the following:
order by time limit 1;

Since there is a "limit 1", we check if there is an opportunity
to avoid having to do both the complete join execution and
the file sort. And in this case there is one such possibility:
by using the KEY time (time).

we "only" need to read entries from this index until the join
has produced the first result record. So the optimizer decides
to use this index for the first table of the join.

In most cases this is a very good optimization since we both
avoid the complete join and the file sort of the result.
Unfortunately, in this case it seems like the query will produce
zero results. As a consequence we will read the entire time index
to its end - and for each index entry we need to look
up the record in the clustered index.
The difference between 5.5 and 5.6 is that in 5.5 many/most
cases failed to select an index that could be used to avoid
the sorting. This bug has been fixed in 5.6. And in a case
like this one, this fix made the bad decision.

Solution:
test_if_cheaper_ordering() uses a cost based model to choose
the index for ordering when limit is specified.
Use the same in make_join_select() instead of the heuristic
based decision which is present right now.

Test file changes for myisam_explain_non_select_none, etc
is because, the test requires to see that the same plan is chosen
for DELETE and SELECT. With the previous queries, this is what is
happening w.r.t select queries.
Optimizer thinks that doing table scan is much cheaper
than doing the range scan initially. But, later
when we re-consider the plan because of small limit, even with the
changes made to test_if_cheaper_ordering, cost is more than
the cost for table scan.
As a result, we do not create a range scan.
But earlier, because cost based decision was not in picture,
optimizer used to create a range scan because the key gives the required
sort order.

As a result, limit value had to be changed to get range scan picked
over table scan.
Same for single_delete_update.test.

This worklog only implements the cost based handling for select queries.
Most single table insert/delete/update statements takes a simpler path
through the optimizer. For these queries the choice of using an
"order by limit" supporting index is still not cost based in some cases.

Existing issues that are not solved by this worklog:
1. Cost calculation in test_if_cheaper_ordering() does not take into
consideration the rows_evaluate_cost after the table scan/index scan
is done. This will make a difference when the key also has a where
condition that can filter many rows.
2. Along with this, we also need to consider the cost for file sort, which
is currently not taken into consideration.
3. The current cost model in test_if_cheaper_ordering does not
take into consideration that, when the key that gives order
also has a range condition, then the total number of records
that need to be scanned will be the number of rows that
satisfy the range condition, not the number of records
present in the table.
This can be considered during cost calculation as, we are eventually
going to create range scan if this is the case.
4. select_limit is not reset after checking for an index. This can
lead to bugs when there are multiple indexes for order by
5. Check for a covering index is not complete. In case of innodb
a secondary index always has access to primary key fields.

1d05b689

WL#6986: Make switching of index due to small limit cost based

Chaithra Gopalareddy authored Feb 06, 2015

This worklog makes the optimization of queries of the form
"SELECT .. ORDER BY LIMIT"  cost based.

Problem:
The decision to use an index that gives ordered results is not
cost based in all cases when optimizing "order by limit" queries.
This can result in a bad query plan for some queries.
Analysis:
Using the following example query from Bug#16522053
SELECT patientId, time
FROM t1
WHERE patientId > 41 AND patientId < 43 AND time > 0 AND
illness="Headache" ORDER BY time LIMIT 1;

1. During the initial access method estimations 5.6 selects
the same index and access strategy as 5.5 as the best:
Index merge on (patientId, primary key)(note that this
strategy will require file sort of the final result)

2. During join optimization this access method is kept as
the best access method.

3. After join optimization, the optimizer sees if there
are further opportunities for optimizing, and in this case
there is: The query contains the following:
order by time limit 1;

Since there is a "limit 1", we check if there is an opportunity
to avoid having to do both the complete join execution and
the file sort. And in this case there is one such possibility:
by using the KEY time (time).

we "only" need to read entries from this index until the join
has produced the first result record. So the optimizer decides
to use this index for the first table of the join.

In most cases this is a very good optimization since we both
avoid the complete join and the file sort of the result.
Unfortunately, in this case it seems like the query will produce
zero results. As a consequence we will read the entire time index
to its end - and for each index entry we need to look
up the record in the clustered index.
The difference between 5.5 and 5.6 is that in 5.5 many/most
cases failed to select an index that could be used to avoid
the sorting. This bug has been fixed in 5.6. And in a case
like this one, this fix made the bad decision.

Solution:
test_if_cheaper_ordering() uses a cost based model to choose
the index for ordering when limit is specified.
Use the same in make_join_select() instead of the heuristic
based decision which is present right now.

Test file changes for myisam_explain_non_select_none, etc
is because, the test requires to see that the same plan is chosen
for DELETE and SELECT. With the previous queries, this is what is
happening w.r.t select queries.
Optimizer thinks that doing table scan is much cheaper
than doing the range scan initially. But, later
when we re-consider the plan because of small limit, even with the
changes made to test_if_cheaper_ordering, cost is more than
the cost for table scan.
As a result, we do not create a range scan.
But earlier, because cost based decision was not in picture,
optimizer used to create a range scan because the key gives the required
sort order.

As a result, limit value had to be changed to get range scan picked
over table scan.
Same for single_delete_update.test.

This worklog only implements the cost based handling for select queries.
Most single table insert/delete/update statements takes a simpler path
through the optimizer. For these queries the choice of using an
"order by limit" supporting index is still not cost based in some cases.

Existing issues that are not solved by this worklog:
1. Cost calculation in test_if_cheaper_ordering() does not take into
consideration the rows_evaluate_cost after the table scan/index scan
is done. This will make a difference when the key also has a where
condition that can filter many rows.
2. Along with this, we also need to consider the cost for file sort, which
is currently not taken into consideration.
3. The current cost model in test_if_cheaper_ordering does not
take into consideration that, when the key that gives order
also has a range condition, then the total number of records
that need to be scanned will be the number of rows that
satisfy the range condition, not the number of records
present in the table.
This can be considered during cost calculation as, we are eventually
going to create range scan if this is the case.
4. select_limit is not reset after checking for an index. This can
lead to bugs when there are multiple indexes for order by
5. Check for a covering index is not complete. In case of innodb
a secondary index always has access to primary key fields.