Skip to content
  • Paweł Olchawa's avatar
    270d1836
    BUG#28072385 INNODB CRASHES FROM LONG LOCK WAIT - LOG WRITER · 270d1836
    Paweł Olchawa authored
                 WAITING FOR CHECKPOINTER
    
    1. We missed a margin between boundary up to which log writer
    may write and boundary up to which log_free_check() calls do
    not have to wait (after they included the concurrency margin).
    This margin has been restored and it is called "extra_margin".
    
    2. There was a completely useless safety-margin equal to 10%
    of true redo capacity. The safety margin is supposed not to
    be used (never, ever). We decrease size of the margin to:
      min(10% of true redo capacity, 1024 * UNIV_PAGE_SIZE).
    
    3. The concurrency margin became increased by 5% of redo capacity
    (after the safety_margin is subtracted from the redo capacity).
    Therefore the initial value of concurrency_margin is now greater
    by a value which does not depend on number of concurrent threads.
    This is an extra protection for two cases:
      - when innodb_thread_concurrency = 0,
      - when we miss a call to log_free_check() in some flow (bug).
    
    4. What's more, the concurrency_margin became adaptively increased
    when log_writer enters the extra_margin. If that happens, we had to
    either encounter a missing call to log_free_check() or we have the
    unlimited thread concurrency (innodb_thread_concurrency = 0) case.
    Then next log_free_check() calls are stopped even earlier and the
    log_writer thread proceeds further (it still has the extra_margin
    in which it can operate (5% of redo capacity)). We hopefully will
    resolve the problems (thanks to log_writer writing further, which
    allows to unlock the oldest dirty pages, flush them and advance
    the checkpoint) and next time we will stop users at log_free_check()
    calls earlier (this point might be moved up to even 50% of redo).
    When log_writer exits the extra_margin (by checkpoint advanced),
    it notices the fact and next time it re-enters the extra_margin,
    it increases the concurrency_margin again. Each time it increases
    the concurrency_margin, it does that by 20%.
    
    5. Do not kill MySQL explicitly when log_writer runs out of space
    in redo. Keep waiting and emit error every 5s. In case of real
    deadlock it is error monitor thread which kills MySQL (after 10s,
    so we would have the error and we could recognize the reason).
    
    6. Abort redo's archiver when log_writer started to overwrite
    non-archived yet region of the redo log.
    
    7. Decrease sleep time in loop which is executed when log_writer
    ran out of free space in redo log (or reached non-archived region).
    
    8. Do not emit warnings about running out of space in redo soon (we emit
    only error and only when we hit the final limit, and only once per 5s).
    
    9. Emit warning about overwriting non-archived region of redo only once
    per provided period.
    
    10. Introduced new monitor counters which allow to monitor scenarios
    in which we run out of free space in redo.
    
    11. Reclaim space in log recent_written buffer when log_writer is
    waiting for free space in redo log / archiver's progress.
    270d1836
    BUG#28072385 INNODB CRASHES FROM LONG LOCK WAIT - LOG WRITER
    Paweł Olchawa authored
                 WAITING FOR CHECKPOINTER
    
    1. We missed a margin between boundary up to which log writer
    may write and boundary up to which log_free_check() calls do
    not have to wait (after they included the concurrency margin).
    This margin has been restored and it is called "extra_margin".
    
    2. There was a completely useless safety-margin equal to 10%
    of true redo capacity. The safety margin is supposed not to
    be used (never, ever). We decrease size of the margin to:
      min(10% of true redo capacity, 1024 * UNIV_PAGE_SIZE).
    
    3. The concurrency margin became increased by 5% of redo capacity
    (after the safety_margin is subtracted from the redo capacity).
    Therefore the initial value of concurrency_margin is now greater
    by a value which does not depend on number of concurrent threads.
    This is an extra protection for two cases:
      - when innodb_thread_concurrency = 0,
      - when we miss a call to log_free_check() in some flow (bug).
    
    4. What's more, the concurrency_margin became adaptively increased
    when log_writer enters the extra_margin. If that happens, we had to
    either encounter a missing call to log_free_check() or we have the
    unlimited thread concurrency (innodb_thread_concurrency = 0) case.
    Then next log_free_check() calls are stopped even earlier and the
    log_writer thread proceeds further (it still has the extra_margin
    in which it can operate (5% of redo capacity)). We hopefully will
    resolve the problems (thanks to log_writer writing further, which
    allows to unlock the oldest dirty pages, flush them and advance
    the checkpoint) and next time we will stop users at log_free_check()
    calls earlier (this point might be moved up to even 50% of redo).
    When log_writer exits the extra_margin (by checkpoint advanced),
    it notices the fact and next time it re-enters the extra_margin,
    it increases the concurrency_margin again. Each time it increases
    the concurrency_margin, it does that by 20%.
    
    5. Do not kill MySQL explicitly when log_writer runs out of space
    in redo. Keep waiting and emit error every 5s. In case of real
    deadlock it is error monitor thread which kills MySQL (after 10s,
    so we would have the error and we could recognize the reason).
    
    6. Abort redo's archiver when log_writer started to overwrite
    non-archived yet region of the redo log.
    
    7. Decrease sleep time in loop which is executed when log_writer
    ran out of free space in redo log (or reached non-archived region).
    
    8. Do not emit warnings about running out of space in redo soon (we emit
    only error and only when we hit the final limit, and only once per 5s).
    
    9. Emit warning about overwriting non-archived region of redo only once
    per provided period.
    
    10. Introduced new monitor counters which allow to monitor scenarios
    in which we run out of free space in redo.
    
    11. Reclaim space in log recent_written buffer when log_writer is
    waiting for free space in redo log / archiver's progress.
Loading