-
Chaithra Gopalareddy authored
This worklog makes the optimization of queries of the form "SELECT .. ORDER BY LIMIT" cost based. Problem: The decision to use an index that gives ordered results is not cost based in all cases when optimizing "order by limit" queries. This can result in a bad query plan for some queries. Analysis: Using the following example query from Bug#16522053 SELECT patientId, time FROM t1 WHERE patientId > 41 AND patientId < 43 AND time > 0 AND illness="Headache" ORDER BY time LIMIT 1; 1. During the initial access method estimations 5.6 selects the same index and access strategy as 5.5 as the best: Index merge on (patientId, primary key)(note that this strategy will require file sort of the final result) 2. During join optimization this access method is kept as the best access method. 3. After join optimization, the optimizer sees if there are further opportunities for optimizing, and in this case there is: The query contains the following: order by time limit 1; Since there is a "limit 1", we check if there is an opportunity to avoid having to do both the complete join execution and the file sort. And in this case there is one such possibility: by using the KEY time (time). we "only" need to read entries from this index until the join has produced the first result record. So the optimizer decides to use this index for the first table of the join. In most cases this is a very good optimization since we both avoid the complete join and the file sort of the result. Unfortunately, in this case it seems like the query will produce zero results. As a consequence we will read the entire time index to its end - and for each index entry we need to look up the record in the clustered index. The difference between 5.5 and 5.6 is that in 5.5 many/most cases failed to select an index that could be used to avoid the sorting. This bug has been fixed in 5.6. And in a case like this one, this fix made the bad decision. Solution: test_if_cheaper_ordering() uses a cost based model to choose the index for ordering when limit is specified. Use the same in make_join_select() instead of the heuristic based decision which is present right now. Test file changes for myisam_explain_non_select_none, etc is because, the test requires to see that the same plan is chosen for DELETE and SELECT. With the previous queries, this is what is happening w.r.t select queries. Optimizer thinks that doing table scan is much cheaper than doing the range scan initially. But, later when we re-consider the plan because of small limit, even with the changes made to test_if_cheaper_ordering, cost is more than the cost for table scan. As a result, we do not create a range scan. But earlier, because cost based decision was not in picture, optimizer used to create a range scan because the key gives the required sort order. As a result, limit value had to be changed to get range scan picked over table scan. Same for single_delete_update.test. This worklog only implements the cost based handling for select queries. Most single table insert/delete/update statements takes a simpler path through the optimizer. For these queries the choice of using an "order by limit" supporting index is still not cost based in some cases. Existing issues that are not solved by this worklog: 1. Cost calculation in test_if_cheaper_ordering() does not take into consideration the rows_evaluate_cost after the table scan/index scan is done. This will make a difference when the key also has a where condition that can filter many rows. 2. Along with this, we also need to consider the cost for file sort, which is currently not taken into consideration. 3. The current cost model in test_if_cheaper_ordering does not take into consideration that, when the key that gives order also has a range condition, then the total number of records that need to be scanned will be the number of rows that satisfy the range condition, not the number of records present in the table. This can be considered during cost calculation as, we are eventually going to create range scan if this is the case. 4. select_limit is not reset after checking for an index. This can lead to bugs when there are multiple indexes for order by 5. Check for a covering index is not complete. In case of innodb a secondary index always has access to primary key fields.
Chaithra Gopalareddy authoredThis worklog makes the optimization of queries of the form "SELECT .. ORDER BY LIMIT" cost based. Problem: The decision to use an index that gives ordered results is not cost based in all cases when optimizing "order by limit" queries. This can result in a bad query plan for some queries. Analysis: Using the following example query from Bug#16522053 SELECT patientId, time FROM t1 WHERE patientId > 41 AND patientId < 43 AND time > 0 AND illness="Headache" ORDER BY time LIMIT 1; 1. During the initial access method estimations 5.6 selects the same index and access strategy as 5.5 as the best: Index merge on (patientId, primary key)(note that this strategy will require file sort of the final result) 2. During join optimization this access method is kept as the best access method. 3. After join optimization, the optimizer sees if there are further opportunities for optimizing, and in this case there is: The query contains the following: order by time limit 1; Since there is a "limit 1", we check if there is an opportunity to avoid having to do both the complete join execution and the file sort. And in this case there is one such possibility: by using the KEY time (time). we "only" need to read entries from this index until the join has produced the first result record. So the optimizer decides to use this index for the first table of the join. In most cases this is a very good optimization since we both avoid the complete join and the file sort of the result. Unfortunately, in this case it seems like the query will produce zero results. As a consequence we will read the entire time index to its end - and for each index entry we need to look up the record in the clustered index. The difference between 5.5 and 5.6 is that in 5.5 many/most cases failed to select an index that could be used to avoid the sorting. This bug has been fixed in 5.6. And in a case like this one, this fix made the bad decision. Solution: test_if_cheaper_ordering() uses a cost based model to choose the index for ordering when limit is specified. Use the same in make_join_select() instead of the heuristic based decision which is present right now. Test file changes for myisam_explain_non_select_none, etc is because, the test requires to see that the same plan is chosen for DELETE and SELECT. With the previous queries, this is what is happening w.r.t select queries. Optimizer thinks that doing table scan is much cheaper than doing the range scan initially. But, later when we re-consider the plan because of small limit, even with the changes made to test_if_cheaper_ordering, cost is more than the cost for table scan. As a result, we do not create a range scan. But earlier, because cost based decision was not in picture, optimizer used to create a range scan because the key gives the required sort order. As a result, limit value had to be changed to get range scan picked over table scan. Same for single_delete_update.test. This worklog only implements the cost based handling for select queries. Most single table insert/delete/update statements takes a simpler path through the optimizer. For these queries the choice of using an "order by limit" supporting index is still not cost based in some cases. Existing issues that are not solved by this worklog: 1. Cost calculation in test_if_cheaper_ordering() does not take into consideration the rows_evaluate_cost after the table scan/index scan is done. This will make a difference when the key also has a where condition that can filter many rows. 2. Along with this, we also need to consider the cost for file sort, which is currently not taken into consideration. 3. The current cost model in test_if_cheaper_ordering does not take into consideration that, when the key that gives order also has a range condition, then the total number of records that need to be scanned will be the number of rows that satisfy the range condition, not the number of records present in the table. This can be considered during cost calculation as, we are eventually going to create range scan if this is the case. 4. select_limit is not reset after checking for an index. This can lead to bugs when there are multiple indexes for order by 5. Check for a covering index is not complete. In case of innodb a secondary index always has access to primary key fields.
Loading