-
Steinar H. Gunderson authored
Implement final duplicate removal by means of filesort. Normally, deduplication (SELECT DISTINCT) happens by adding a unique index to the final temporary table. However, in the cases where we do aggregation directly into a temporary table, we cannot use such an index, since rows change during query execution and thus cannot be deduplicated on-the-fly. (E.g., consider SELECT DISTINCT COUNT(*) FROM t1 GROUP BY f1.) The old executor solves this by adding a post-pass that actually deletes rows from the temporary table; for small tables, it uses a hash table to deduplicate, but for larger ones, it uses an O(n^2) algorithm based on pairwise comparison, which is extremely slow. Neither fits very well in an iterator design, and thus, we replace this step by filesort, which is consistently O(n log n) with a small constant factor. Filesort needs to be extended with support for deduplicating rows, which is done as part of this work. Note that this removes a determinism-by-accident that was in filesort earlier (if the algorithm decided to sort by row ID, the row ID would be part of the key). This also requires new functionality in the test framework, so that we can test for partially ordered results (--partial_sorted_result). Change-Id: I985f6d1f30630d6e8e20767c67ba8b4382144df6
Steinar H. Gunderson authoredImplement final duplicate removal by means of filesort. Normally, deduplication (SELECT DISTINCT) happens by adding a unique index to the final temporary table. However, in the cases where we do aggregation directly into a temporary table, we cannot use such an index, since rows change during query execution and thus cannot be deduplicated on-the-fly. (E.g., consider SELECT DISTINCT COUNT(*) FROM t1 GROUP BY f1.) The old executor solves this by adding a post-pass that actually deletes rows from the temporary table; for small tables, it uses a hash table to deduplicate, but for larger ones, it uses an O(n^2) algorithm based on pairwise comparison, which is extremely slow. Neither fits very well in an iterator design, and thus, we replace this step by filesort, which is consistently O(n log n) with a small constant factor. Filesort needs to be extended with support for deduplicating rows, which is done as part of this work. Note that this removes a determinism-by-accident that was in filesort earlier (if the algorithm decided to sort by row ID, the row ID would be part of the key). This also requires new functionality in the test framework, so that we can test for partially ordered results (--partial_sorted_result). Change-Id: I985f6d1f30630d6e8e20767c67ba8b4382144df6
Loading