Skip to content
  • Steinar H. Gunderson's avatar
    f71b0b3b
    WL #12788: Iterator executor analytics queries [7/7, duplicate removal] · f71b0b3b
    Steinar H. Gunderson authored
    Implement final duplicate removal by means of filesort.
    
    Normally, deduplication (SELECT DISTINCT) happens by adding a unique index to
    the final temporary table. However, in the cases where we do aggregation
    directly into a temporary table, we cannot use such an index, since rows change
    during query execution and thus cannot be deduplicated on-the-fly.
    (E.g., consider SELECT DISTINCT COUNT(*) FROM t1 GROUP BY f1.) The old executor solves
    this by adding a post-pass that actually deletes rows from the temporary table;
    for small tables, it uses a hash table to deduplicate, but for larger ones, it
    uses an O(n^2) algorithm based on pairwise comparison, which is extremely slow.
    Neither fits very well in an iterator design, and thus, we replace this
    step by filesort, which is consistently O(n log n) with a small constant
    factor. Filesort needs to be extended with support for deduplicating rows,
    which is done as part of this work.
    
    Note that this removes a determinism-by-accident that was in filesort earlier
    (if the algorithm decided to sort by row ID, the row ID would be part of the
    key). This also requires new functionality in the test framework, so that we
    can test for partially ordered results (--partial_sorted_result).
    
    Change-Id: I985f6d1f30630d6e8e20767c67ba8b4382144df6
    f71b0b3b
    WL #12788: Iterator executor analytics queries [7/7, duplicate removal]
    Steinar H. Gunderson authored
    Implement final duplicate removal by means of filesort.
    
    Normally, deduplication (SELECT DISTINCT) happens by adding a unique index to
    the final temporary table. However, in the cases where we do aggregation
    directly into a temporary table, we cannot use such an index, since rows change
    during query execution and thus cannot be deduplicated on-the-fly.
    (E.g., consider SELECT DISTINCT COUNT(*) FROM t1 GROUP BY f1.) The old executor solves
    this by adding a post-pass that actually deletes rows from the temporary table;
    for small tables, it uses a hash table to deduplicate, but for larger ones, it
    uses an O(n^2) algorithm based on pairwise comparison, which is extremely slow.
    Neither fits very well in an iterator design, and thus, we replace this
    step by filesort, which is consistently O(n log n) with a small constant
    factor. Filesort needs to be extended with support for deduplicating rows,
    which is done as part of this work.
    
    Note that this removes a determinism-by-accident that was in filesort earlier
    (if the algorithm decided to sort by row ID, the row ID would be part of the
    key). This also requires new functionality in the test framework, so that we
    can test for partially ordered results (--partial_sorted_result).
    
    Change-Id: I985f6d1f30630d6e8e20767c67ba8b4382144df6
Loading