sql/merge_many_buff.h · 77306a41670543e103f94fa0db77a71d2b9f88b2 · Rasoul Jahanshahi / Mysql Server

Jan 04, 2019

WL #12788: Iterator executor analytics queries [7/7, duplicate removal] · f71b0b3b

Steinar H. Gunderson authored Jan 04, 2019

Implement final duplicate removal by means of filesort.

Normally, deduplication (SELECT DISTINCT) happens by adding a unique index to
the final temporary table. However, in the cases where we do aggregation
directly into a temporary table, we cannot use such an index, since rows change
during query execution and thus cannot be deduplicated on-the-fly.
(E.g., consider SELECT DISTINCT COUNT(*) FROM t1 GROUP BY f1.) The old executor solves
this by adding a post-pass that actually deletes rows from the temporary table;
for small tables, it uses a hash table to deduplicate, but for larger ones, it
uses an O(n^2) algorithm based on pairwise comparison, which is extremely slow.
Neither fits very well in an iterator design, and thus, we replace this
step by filesort, which is consistently O(n log n) with a small constant
factor. Filesort needs to be extended with support for deduplicating rows,
which is done as part of this work.

Note that this removes a determinism-by-accident that was in filesort earlier
(if the algorithm decided to sort by row ID, the row ID would be part of the
key). This also requires new functionality in the test framework, so that we
can test for partially ordered results (--partial_sorted_result).

Change-Id: I985f6d1f30630d6e8e20767c67ba8b4382144df6

f71b0b3b

WL #12788: Iterator executor analytics queries [7/7, duplicate removal]

Steinar H. Gunderson authored Jan 04, 2019

Implement final duplicate removal by means of filesort.

Normally, deduplication (SELECT DISTINCT) happens by adding a unique index to
the final temporary table. However, in the cases where we do aggregation
directly into a temporary table, we cannot use such an index, since rows change
during query execution and thus cannot be deduplicated on-the-fly.
(E.g., consider SELECT DISTINCT COUNT(*) FROM t1 GROUP BY f1.) The old executor solves
this by adding a post-pass that actually deletes rows from the temporary table;
for small tables, it uses a hash table to deduplicate, but for larger ones, it
uses an O(n^2) algorithm based on pairwise comparison, which is extremely slow.
Neither fits very well in an iterator design, and thus, we replace this
step by filesort, which is consistently O(n log n) with a small constant
factor. Filesort needs to be extended with support for deduplicating rows,
which is done as part of this work.

Note that this removes a determinism-by-accident that was in filesort earlier
(if the algorithm decided to sort by row ID, the row ID would be part of the
key). This also requires new functionality in the test framework, so that we
can test for partially ordered results (--partial_sorted_result).

Change-Id: I985f6d1f30630d6e8e20767c67ba8b4382144df6