storage/innobase/arch/arch0log.cc · e4924f36486f971f8a04252e01c803457a2c72f7 · Rasoul Jahanshahi / Mysql Server

Jun 10, 2018

BUG#28072385 INNODB CRASHES FROM LONG LOCK WAIT - LOG WRITER · 270d1836

Paweł Olchawa authored Jun 10, 2018

             WAITING FOR CHECKPOINTER

1. We missed a margin between boundary up to which log writer
may write and boundary up to which log_free_check() calls do
not have to wait (after they included the concurrency margin).
This margin has been restored and it is called "extra_margin".

2. There was a completely useless safety-margin equal to 10%
of true redo capacity. The safety margin is supposed not to
be used (never, ever). We decrease size of the margin to:
  min(10% of true redo capacity, 1024 * UNIV_PAGE_SIZE).

3. The concurrency margin became increased by 5% of redo capacity
(after the safety_margin is subtracted from the redo capacity).
Therefore the initial value of concurrency_margin is now greater
by a value which does not depend on number of concurrent threads.
This is an extra protection for two cases:
  - when innodb_thread_concurrency = 0,
  - when we miss a call to log_free_check() in some flow (bug).

4. What's more, the concurrency_margin became adaptively increased
when log_writer enters the extra_margin. If that happens, we had to
either encounter a missing call to log_free_check() or we have the
unlimited thread concurrency (innodb_thread_concurrency = 0) case.
Then next log_free_check() calls are stopped even earlier and the
log_writer thread proceeds further (it still has the extra_margin
in which it can operate (5% of redo capacity)). We hopefully will
resolve the problems (thanks to log_writer writing further, which
allows to unlock the oldest dirty pages, flush them and advance
the checkpoint) and next time we will stop users at log_free_check()
calls earlier (this point might be moved up to even 50% of redo).
When log_writer exits the extra_margin (by checkpoint advanced),
it notices the fact and next time it re-enters the extra_margin,
it increases the concurrency_margin again. Each time it increases
the concurrency_margin, it does that by 20%.

5. Do not kill MySQL explicitly when log_writer runs out of space
in redo. Keep waiting and emit error every 5s. In case of real
deadlock it is error monitor thread which kills MySQL (after 10s,
so we would have the error and we could recognize the reason).

6. Abort redo's archiver when log_writer started to overwrite
non-archived yet region of the redo log.

7. Decrease sleep time in loop which is executed when log_writer
ran out of free space in redo log (or reached non-archived region).

8. Do not emit warnings about running out of space in redo soon (we emit
only error and only when we hit the final limit, and only once per 5s).

9. Emit warning about overwriting non-archived region of redo only once
per provided period.

10. Introduced new monitor counters which allow to monitor scenarios
in which we run out of free space in redo.

11. Reclaim space in log recent_written buffer when log_writer is
waiting for free space in redo log / archiver's progress.

270d1836

BUG#28072385 INNODB CRASHES FROM LONG LOCK WAIT - LOG WRITER

Paweł Olchawa authored Jun 10, 2018

             WAITING FOR CHECKPOINTER

1. We missed a margin between boundary up to which log writer
may write and boundary up to which log_free_check() calls do
not have to wait (after they included the concurrency margin).
This margin has been restored and it is called "extra_margin".

2. There was a completely useless safety-margin equal to 10%
of true redo capacity. The safety margin is supposed not to
be used (never, ever). We decrease size of the margin to:
  min(10% of true redo capacity, 1024 * UNIV_PAGE_SIZE).

3. The concurrency margin became increased by 5% of redo capacity
(after the safety_margin is subtracted from the redo capacity).
Therefore the initial value of concurrency_margin is now greater
by a value which does not depend on number of concurrent threads.
This is an extra protection for two cases:
  - when innodb_thread_concurrency = 0,
  - when we miss a call to log_free_check() in some flow (bug).

4. What's more, the concurrency_margin became adaptively increased
when log_writer enters the extra_margin. If that happens, we had to
either encounter a missing call to log_free_check() or we have the
unlimited thread concurrency (innodb_thread_concurrency = 0) case.
Then next log_free_check() calls are stopped even earlier and the
log_writer thread proceeds further (it still has the extra_margin
in which it can operate (5% of redo capacity)). We hopefully will
resolve the problems (thanks to log_writer writing further, which
allows to unlock the oldest dirty pages, flush them and advance
the checkpoint) and next time we will stop users at log_free_check()
calls earlier (this point might be moved up to even 50% of redo).
When log_writer exits the extra_margin (by checkpoint advanced),
it notices the fact and next time it re-enters the extra_margin,
it increases the concurrency_margin again. Each time it increases
the concurrency_margin, it does that by 20%.

5. Do not kill MySQL explicitly when log_writer runs out of space
in redo. Keep waiting and emit error every 5s. In case of real
deadlock it is error monitor thread which kills MySQL (after 10s,
so we would have the error and we could recognize the reason).

6. Abort redo's archiver when log_writer started to overwrite
non-archived yet region of the redo log.

7. Decrease sleep time in loop which is executed when log_writer
ran out of free space in redo log (or reached non-archived region).

8. Do not emit warnings about running out of space in redo soon (we emit
only error and only when we hit the final limit, and only once per 5s).

9. Emit warning about overwriting non-archived region of redo only once
per provided period.

10. Introduced new monitor counters which allow to monitor scenarios
in which we run out of free space in redo.

11. Reclaim space in log recent_written buffer when log_writer is
waiting for free space in redo log / archiver's progress.