Skip to content
  • Oystein Grovlen's avatar
    4f3b53f5
    Bug#20701585 SEMI-JOIN DUPLICATE ELIMINATION GIVES LOWER COST THAN NO DUPLICATE ELIMINATION · 4f3b53f5
    Oystein Grovlen authored
    After the introduction of filtering estimation for conditions
    (WL#6635), the expected number of rows selected from a table may be
    between 0 and 1.  Some of our algorithms are not prepared for this.
    One observed effect is that semi-join DuplicateWeedout plans include
    tables in the duplicate-eliminating range that are not relevant to the
    semi-join (so-called nt tables).  This happens because extending the
    range with a table where the estimated number of rows are less than
    one make the cost go down.
    
    This patch solves this issue by ensuring that filtering estimates
    always gives at least one row.  I think there are several reasons to
    do this instead of trying to adapt our code to work with values below
    1:
    
     - WL#6635 already put such a lower limit on the filter effect of a
       single condition.  Hence, values lower than 1 may only happen where
       there are multiple conditions on a table.  However, I think the
       same reasoning could be made in case of multiple conditions.  That
       is, filtering estimates should assume at least 1 row regardless of
       number of conditions.
    
     - The assumption when computing filtering effect of multiple
       conditions is that conditions are non-correlated.  In actual
       databases, this is normally not the case.  Hence, the filtering
       estimates of multiple conditions will often be lower than what is
       actual the case.
    
     - I think it is more important to get an optimized query plan for the
       cases where the result is non-empty than for the case when no
       records are returned.  With very low numbers, one risk that the
       join order has very little impact on the total cost of the query.
       This means that for the cases when there is actual a match, the
       join order may not be optimal.
    
     - It is not trivial to adjust the optimizer code to cope with values
       lower than 1.
    
    Note: The existing limitation on filter, "filter*fanout >= 0.05", in
    calculate_condition_filter() is kept.  I am not convinced this
    limitation makes sense, but dropping it had negative effects on DBT-3
    query 8.
    
    This patch does not contain a new test case.  The "How-to-repeat" part
    of the bug report consists of an existing test case in subquery.inc,
    and the subquery_sj_dupsweed.test shows that an nt table is not longer
    included in the duplicate-producing range.  (See first hunk of
    subquery_sj_dupsweedout.result).
    
    File comments:
    (In addition comes result files where all changes are changes in
    filter and row estimates.  That is, there is no plan changes.)
    
    sql/sql_planner.cc
      Make sure calculate_condition_filter() never returns a number that
      corresponds to less than 1 row in the given table.
    
    mysql-test/include/explain_json.inc
      Added some rows to a table of a test case to get a plan which use
      materialization.  According to comment in the test, this is what was
      intended when test case was added.
    
    mysql-test/include/subquery_sj.inc
      Added some rows to a table of a test case to get same plan as before.
    
    mysql-test/r/derived.result
      Join order has changed, but this should not be relevant for the
      objective of the test.
    
    mysql-test/r/explain_json_all.result
      More records in table leads to different cost estimates.
      Test case now use materialization which was the original intention
    
    mysql-test/r/explain_json_none.result
      More records in table leads to different cost estimates, but no plan change.
    
    mysql-test/r/greedy_optimizer.result
      Query plans show different join order.  New join order is expected
      since tables with lowest filter estimate comes first.  (Previously,
      filter estimates were equal for the tables in question.)
    
    mysql-test/r/greedy_search.result
      One test case has a slight increase in number of partial query plans
      evaluated.
    
    mysql-test/r/partition_explicit_prune.result
      Changes in number of handler accesses.  All changed numbers are
      lower than previously.  (Neither the values before nor after match
      what comments in test says is exepected.)
    
    mysql-test/r/subquery_sj_all.result
    mysql-test/r/subquery_sj_all_bka.result
    mysql-test/r/subquery_sj_all_bkaunique.result
      @@ -3406,12 +3406,13 @@
        Changes from DupsWeedout => MatScan.  According to comment this
        test case is a regression test case for MatScan.
      @@ -4067,12 +4068,12 @@
        According to comment in test, this test case is to reproduce an
        issue with MatLookup.  MatLookup plan is still used in
        subquery_sj_mat.test.  I doubt that it is possible to get MatLookup by
        default as long as condition filtering is used.
      @@ -4383,9 +4388,9 @@
        Added some rows to one table to get same plan as before.
      @@ -6215,11 +6220,11 @@
        New plan is the same plan as when the test case was pushed
      @@ -9675,11 +9680,11 @@
        Different join order from before, but neither old nor new
        results have plans that are similar to original plan.
    
    mysql-test/r/subquery_sj_all_bka_nixbnl.result
      Changes are similar to subquery_sj_all.result except:
      @@ -4298,11 +4299,12 @@
      @@ -4434,11 +4440,12 @@
      @@ -4542,11 +4549,12 @@
      @@ -4632,11 +4640,12 @@
      @@ -4740,11 +4749,12 @@
        DupsWeedout => MatScan. According to test comment, MatLookup is
        what must be avoided
      @@ -6224,10 +6234,10 @@
        No longer DupsWeedout that covers both semi-join nests, but
        different algorithms for the two. This is a result of DupsWeedout
        no longer including more tables than necessary.  (Note that the
        tests where BNL is allowed has FirstMatch for both.)
    
    mysql-test/r/subquery_sj_dupsweed.result
    mysql-test/r/subquery_sj_dupsweed_bka.result
    mysql-test/r/subquery_sj_dupsweed_bkaunique.result
      Differences from subquery_sj_all.result:
      @@ -3378,12 +3378,12 @@
        This is the result change proves the point of this patch.  New
        plan does no longer include nt table in the temporary table range.
      @@ -6190,10 +6194,10 @@
        Different join order, but subquery_sj_all.test covers the original plan
    
    mysql-test/r/subquery_sj_dupsweed_bka_nixbnl.result
      Changes that differ from subquery_sj_dupsweed_bka.result are changes
      to use same plan as subquery_sj_dupsweed_bka.result
    
    mysql-test/r/subquery_sj_firstmatch.result
    mysql-test/r/subquery_sj_firstmatch_bka.result
    mysql-test/r/subquery_sj_firstmatch_bka_nixbnl.result
    mysql-test/r/subquery_sj_firstmatch_bkaunique.result
      @@ -7066,11 +7070,11 @@
      @@ -7112,32 +7116,6 @@
        New join order, but but subquery_sj_all.test covers the old plan
      @@ -6173,8 +6177,8 @@
        New plan is came as query plan when test case was pushed
      @@ -11311,17 +11324,46 @@
        Table with increased filter estimate comes later in join order
    
    mysql-test/r/subquery_sj_loosescan.result
    mysql-test/r/subquery_sj_loosescan_bka.result
    mysql-test/r/subquery_sj_loosescan_bka_nixbnl.result
    mysql-test/r/subquery_sj_loosescan_bkaunique.result
    mysql-test/r/subquery_sj_mat.result
    mysql-test/r/subquery_sj_mat_bka.result
    mysql-test/r/subquery_sj_mat_bka_nixbnl.result
    mysql-test/r/subquery_sj_mat_bkaunique.result
    mysql-test/r/subquery_sj_mat_nosj.result
      Results differences is only in tests that use DupsWeedout because
      LooseScan or Materialization is not applicable.  Plans may be
      different from subquery_sj_dupsweedout.test since these tests have
      turned off DuplicateWeedout and that will effect plan pruning.
    
    mysql-test/r/view.result
    
      Tests get different join order due to different filter estimates for
      one table.  This should be OK since this does not seem to be a
      regression test case.
    4f3b53f5
    Bug#20701585 SEMI-JOIN DUPLICATE ELIMINATION GIVES LOWER COST THAN NO DUPLICATE ELIMINATION
    Oystein Grovlen authored
    After the introduction of filtering estimation for conditions
    (WL#6635), the expected number of rows selected from a table may be
    between 0 and 1.  Some of our algorithms are not prepared for this.
    One observed effect is that semi-join DuplicateWeedout plans include
    tables in the duplicate-eliminating range that are not relevant to the
    semi-join (so-called nt tables).  This happens because extending the
    range with a table where the estimated number of rows are less than
    one make the cost go down.
    
    This patch solves this issue by ensuring that filtering estimates
    always gives at least one row.  I think there are several reasons to
    do this instead of trying to adapt our code to work with values below
    1:
    
     - WL#6635 already put such a lower limit on the filter effect of a
       single condition.  Hence, values lower than 1 may only happen where
       there are multiple conditions on a table.  However, I think the
       same reasoning could be made in case of multiple conditions.  That
       is, filtering estimates should assume at least 1 row regardless of
       number of conditions.
    
     - The assumption when computing filtering effect of multiple
       conditions is that conditions are non-correlated.  In actual
       databases, this is normally not the case.  Hence, the filtering
       estimates of multiple conditions will often be lower than what is
       actual the case.
    
     - I think it is more important to get an optimized query plan for the
       cases where the result is non-empty than for the case when no
       records are returned.  With very low numbers, one risk that the
       join order has very little impact on the total cost of the query.
       This means that for the cases when there is actual a match, the
       join order may not be optimal.
    
     - It is not trivial to adjust the optimizer code to cope with values
       lower than 1.
    
    Note: The existing limitation on filter, "filter*fanout >= 0.05", in
    calculate_condition_filter() is kept.  I am not convinced this
    limitation makes sense, but dropping it had negative effects on DBT-3
    query 8.
    
    This patch does not contain a new test case.  The "How-to-repeat" part
    of the bug report consists of an existing test case in subquery.inc,
    and the subquery_sj_dupsweed.test shows that an nt table is not longer
    included in the duplicate-producing range.  (See first hunk of
    subquery_sj_dupsweedout.result).
    
    File comments:
    (In addition comes result files where all changes are changes in
    filter and row estimates.  That is, there is no plan changes.)
    
    sql/sql_planner.cc
      Make sure calculate_condition_filter() never returns a number that
      corresponds to less than 1 row in the given table.
    
    mysql-test/include/explain_json.inc
      Added some rows to a table of a test case to get a plan which use
      materialization.  According to comment in the test, this is what was
      intended when test case was added.
    
    mysql-test/include/subquery_sj.inc
      Added some rows to a table of a test case to get same plan as before.
    
    mysql-test/r/derived.result
      Join order has changed, but this should not be relevant for the
      objective of the test.
    
    mysql-test/r/explain_json_all.result
      More records in table leads to different cost estimates.
      Test case now use materialization which was the original intention
    
    mysql-test/r/explain_json_none.result
      More records in table leads to different cost estimates, but no plan change.
    
    mysql-test/r/greedy_optimizer.result
      Query plans show different join order.  New join order is expected
      since tables with lowest filter estimate comes first.  (Previously,
      filter estimates were equal for the tables in question.)
    
    mysql-test/r/greedy_search.result
      One test case has a slight increase in number of partial query plans
      evaluated.
    
    mysql-test/r/partition_explicit_prune.result
      Changes in number of handler accesses.  All changed numbers are
      lower than previously.  (Neither the values before nor after match
      what comments in test says is exepected.)
    
    mysql-test/r/subquery_sj_all.result
    mysql-test/r/subquery_sj_all_bka.result
    mysql-test/r/subquery_sj_all_bkaunique.result
      @@ -3406,12 +3406,13 @@
        Changes from DupsWeedout => MatScan.  According to comment this
        test case is a regression test case for MatScan.
      @@ -4067,12 +4068,12 @@
        According to comment in test, this test case is to reproduce an
        issue with MatLookup.  MatLookup plan is still used in
        subquery_sj_mat.test.  I doubt that it is possible to get MatLookup by
        default as long as condition filtering is used.
      @@ -4383,9 +4388,9 @@
        Added some rows to one table to get same plan as before.
      @@ -6215,11 +6220,11 @@
        New plan is the same plan as when the test case was pushed
      @@ -9675,11 +9680,11 @@
        Different join order from before, but neither old nor new
        results have plans that are similar to original plan.
    
    mysql-test/r/subquery_sj_all_bka_nixbnl.result
      Changes are similar to subquery_sj_all.result except:
      @@ -4298,11 +4299,12 @@
      @@ -4434,11 +4440,12 @@
      @@ -4542,11 +4549,12 @@
      @@ -4632,11 +4640,12 @@
      @@ -4740,11 +4749,12 @@
        DupsWeedout => MatScan. According to test comment, MatLookup is
        what must be avoided
      @@ -6224,10 +6234,10 @@
        No longer DupsWeedout that covers both semi-join nests, but
        different algorithms for the two. This is a result of DupsWeedout
        no longer including more tables than necessary.  (Note that the
        tests where BNL is allowed has FirstMatch for both.)
    
    mysql-test/r/subquery_sj_dupsweed.result
    mysql-test/r/subquery_sj_dupsweed_bka.result
    mysql-test/r/subquery_sj_dupsweed_bkaunique.result
      Differences from subquery_sj_all.result:
      @@ -3378,12 +3378,12 @@
        This is the result change proves the point of this patch.  New
        plan does no longer include nt table in the temporary table range.
      @@ -6190,10 +6194,10 @@
        Different join order, but subquery_sj_all.test covers the original plan
    
    mysql-test/r/subquery_sj_dupsweed_bka_nixbnl.result
      Changes that differ from subquery_sj_dupsweed_bka.result are changes
      to use same plan as subquery_sj_dupsweed_bka.result
    
    mysql-test/r/subquery_sj_firstmatch.result
    mysql-test/r/subquery_sj_firstmatch_bka.result
    mysql-test/r/subquery_sj_firstmatch_bka_nixbnl.result
    mysql-test/r/subquery_sj_firstmatch_bkaunique.result
      @@ -7066,11 +7070,11 @@
      @@ -7112,32 +7116,6 @@
        New join order, but but subquery_sj_all.test covers the old plan
      @@ -6173,8 +6177,8 @@
        New plan is came as query plan when test case was pushed
      @@ -11311,17 +11324,46 @@
        Table with increased filter estimate comes later in join order
    
    mysql-test/r/subquery_sj_loosescan.result
    mysql-test/r/subquery_sj_loosescan_bka.result
    mysql-test/r/subquery_sj_loosescan_bka_nixbnl.result
    mysql-test/r/subquery_sj_loosescan_bkaunique.result
    mysql-test/r/subquery_sj_mat.result
    mysql-test/r/subquery_sj_mat_bka.result
    mysql-test/r/subquery_sj_mat_bka_nixbnl.result
    mysql-test/r/subquery_sj_mat_bkaunique.result
    mysql-test/r/subquery_sj_mat_nosj.result
      Results differences is only in tests that use DupsWeedout because
      LooseScan or Materialization is not applicable.  Plans may be
      different from subquery_sj_dupsweedout.test since these tests have
      turned off DuplicateWeedout and that will effect plan pruning.
    
    mysql-test/r/view.result
    
      Tests get different join order due to different filter estimates for
      one table.  This should be OK since this does not seem to be a
      regression test case.
Loading