sql/hash_join_buffer.h · 91a17cedb1ee880fe7915fb14cfd74c04e8d6588 · Rasoul Jahanshahi / Mysql Server

Aug 14, 2019

Bug#30049083 [REGRESSION]REPLACE/INSERT WITH LIMIT TAKING MORE TIME AND SPACE · 8780998e

Erik Froseth authored Aug 14, 2019

If the hash table in a hash join goes full during a query, the hash join will
write both inputs out to temporary files on disk in order to process both inputs
in smaller chunks. This means that both inputs will be read to the end before
it starts producing output rows. This can be sub-optimal if the query has a
small LIMIT; the output rows could be produced a lot earlier if we did not read
both inputs to the end before starting to produce rows. The test case provided
in the bug report is a query with 13 levels of nested hash join, where each
hash join is the Cartesian product. As the hash joins starts writing both inputs
to disk, it ends up taking a long time and consuming a lot of disk space.

To mitigate this problem, the hash join is not allowed to spill to disk if there
is a LIMIT in the query. When not spilling to disk, the hash join works like
this:

1. Read as many rows as possible from the build input into an in-memory hash
table.
2. When the hash table is full (we have reached the limit set by the system
variable join_buffer_size), start reading from the beginning of the probe
input, probing for matches in the hash table. Output a row for each match
found.
3. When the probe input is empty, see if there are any remaining rows in the
build input. If so, clear the in-memory hash table and go to step 1,
continuing from the build input where we stopped the last time. If not, the
join is done.

This will cause the hash join to potentially produce rows a lot earlier,
which dramatically reduces execution time in certain cases.

There are however two cases where we always allow spill to disk even if the
query contains LIMIT, and that is if we either have grouping or sorting in
the query. In those cases, the iterator above us will most likely consume the
entire result set anyways.

Change-Id: I481208c0fe2f5b2f437a1647ec90af6650c5514f

8780998e

Bug#30049083 [REGRESSION]REPLACE/INSERT WITH LIMIT TAKING MORE TIME AND SPACE

Erik Froseth authored Aug 14, 2019

To mitigate this problem, the hash join is not allowed to spill to disk if there
is a LIMIT in the query. When not spilling to disk, the hash join works like
this:

This will cause the hash join to potentially produce rows a lot earlier,
which dramatically reduces execution time in certain cases.

Change-Id: I481208c0fe2f5b2f437a1647ec90af6650c5514f