storage/innobase/dict/dict0stats_bg.cc · mysql-8.0.11 · Rasoul Jahanshahi / Mysql Server

Feb 14, 2018

WL#10310 Redo log optimization: dedicated threads and concurrent log buffer. · 6be2fa0b

Paweł Olchawa authored Feb 14, 2018

0. Log buffer became a ring buffer, data inside is no longer shifted.

1. User threads are able to write concurrently to log buffer.

2. Relaxed order of dirty pages in flush lists - no need to synchronize
   the order in which dirty pages are added to flush lists.

3. Concurrent MTR commits can interleave on different stages of commits.

4. Introduced dedicated log threads which keep writing log buffer:
    * log_writer: writes log buffer to system buffers,
    * log_flusher: flushes system buffers to disk.
   As soon as they finished writing (flushing) and there is new data to
   write (flush), they start next write (flush).

5. User threads no longer write / flush log buffer to disk, they only
   wait by spinning or on event for notification. They do not have to
   compete for the responsibility of writing / flushing.

6. Introduced a ring buffer of events (one per log-block) which are used
   by user threads to wait for written/flushed redo log to avoid:
    * contention on single event
    * false wake-ups of all waiting threads whenever some write/flush
      has finished (we can wake-up only those waiting in related blocks)

7. Introduced dedicated notifier threads not to delay next writes/fsyncs:
    * log_write_notifier: notifies user threads about written redo,
    * log_flush_notifier: notifies user threads about flushed redo.

8. Master thread no longer has to flush log buffer.

9. Introduced dedicated log thread which is responsible for writing checkpoints.
   No longer concurrent user threads need to compete for this responsibility.

10. Master thread no longer has to take care of periodical checkpoints.
    Log checkpointer thread writes checkpoint at least once per second
    (before it was once per 7 seconds).

11. The following exposed system variables, can be changed in runtime now:
    * innodb_log_buffer_size,
    * innodb_log_write_ahead_size.

12. Master thread measures average global cpu usage in OS.

13. Introduced new exposed system variables:
    * innodb_log_wait_for_flush_spin_hwm,
    * innodb_log_spin_cpu_abs_lwm,
    * innodb_log_spin_cpu_pct_hwm.
    They control when we need to use spinning for the best performance,
    to reduce latency which would otherwise come from communication
    between log threads and user threads. The first one is based on
    average flush time, the two others are based on cpu usage.

14. Introduced new CMake option: ENABLE_EXPERIMENT_SYSVARS=0/1. System variables
    can be marked as hidden unless the experiment mode is turned on.

15. There is a list of hidden new system variables for experiments with redo log.
    We skip listing them here.

16. Created dedicated tester for redo log alone (as gtest).

17. Created doxygen documentation for the new redo log.

18. The dict_persist margin is updated when number of dirty pages is
    changed, instead of calculations on demand.

19. Mechanism used to copy last incomplete block for Clone has been changed,
    because log buffer is concurrent now.

20. Added more useful MONITOR counters for redo, including average lsn rate.

21. Introduced sharded rw-lock to have an option to stop the world in redo,
    because log_mutex is removed.

22. Invented and implemented a concurrent data structure which tracks progress
    of concurrent operations and can answer up to which point they all have been
    finished (when there is some order defined but they are allowed to be executed
    out of the order). This structure is used for concurrent writes to log buffer
    and re-used for concurrent additions to flush lists.

23. Introduced a universal mechanism to wait on event, which starts with
    provided number of spin delays, then fallbacks to waits on event,
    starting at small timeout, but increasing timeout every few waits.
    This mechanism is used in communication between user and log threads,
    and in communication between different log threads.

24. We slow-down redo log writer when there is no space in redo allowing
    checkpoints to progress and rescue the state of redo.

25. Log buffer can be resize in runtime - the size can also be decreased.

26. Simplified shutdown procedure to avoid a possible returns in logic
    to previous phases.

27. Removed concept of multiple log groups.

28. Relaxed conditions required for checkpoint_lsn. It can now point to
    any data byte within redo (does not need to point to a records group
    beginning).

29. Windows: always use buffered IO for redo log.

30. Mysql test runner received a new feature (thanks to Marcin):
    --exec_in_background.

Review: RB#15134

Reviewers:
    - Marcin Babij <marcin.babij@oracle.com>,
    - Debarun Banerjee <debarun.banerjee@oracle.com>.

Performance tests:
    - Dimitri Kravtchuk <dimitri.kravtchuk@oracle.com>,
    - Daniel Blanchard <daniel.blanchard@oracle.com>,
    - Amrendra Kumar <amrendra.x.kumar@oracle.com>.

QA and MTR tests:
    - Vinay Fisrekar <vinay.fisrekar@oracle.com>.

6be2fa0b

WL#10310 Redo log optimization: dedicated threads and concurrent log buffer.

Paweł Olchawa authored Feb 14, 2018

0. Log buffer became a ring buffer, data inside is no longer shifted.

1. User threads are able to write concurrently to log buffer.

2. Relaxed order of dirty pages in flush lists - no need to synchronize
   the order in which dirty pages are added to flush lists.

3. Concurrent MTR commits can interleave on different stages of commits.

4. Introduced dedicated log threads which keep writing log buffer:
    * log_writer: writes log buffer to system buffers,
    * log_flusher: flushes system buffers to disk.
   As soon as they finished writing (flushing) and there is new data to
   write (flush), they start next write (flush).

5. User threads no longer write / flush log buffer to disk, they only
   wait by spinning or on event for notification. They do not have to
   compete for the responsibility of writing / flushing.

6. Introduced a ring buffer of events (one per log-block) which are used
   by user threads to wait for written/flushed redo log to avoid:
    * contention on single event
    * false wake-ups of all waiting threads whenever some write/flush
      has finished (we can wake-up only those waiting in related blocks)

7. Introduced dedicated notifier threads not to delay next writes/fsyncs:
    * log_write_notifier: notifies user threads about written redo,
    * log_flush_notifier: notifies user threads about flushed redo.

8. Master thread no longer has to flush log buffer.

9. Introduced dedicated log thread which is responsible for writing checkpoints.
   No longer concurrent user threads need to compete for this responsibility.

10. Master thread no longer has to take care of periodical checkpoints.
    Log checkpointer thread writes checkpoint at least once per second
    (before it was once per 7 seconds).

11. The following exposed system variables, can be changed in runtime now:
    * innodb_log_buffer_size,
    * innodb_log_write_ahead_size.

12. Master thread measures average global cpu usage in OS.

13. Introduced new exposed system variables:
    * innodb_log_wait_for_flush_spin_hwm,
    * innodb_log_spin_cpu_abs_lwm,
    * innodb_log_spin_cpu_pct_hwm.
    They control when we need to use spinning for the best performance,
    to reduce latency which would otherwise come from communication
    between log threads and user threads. The first one is based on
    average flush time, the two others are based on cpu usage.

14. Introduced new CMake option: ENABLE_EXPERIMENT_SYSVARS=0/1. System variables
    can be marked as hidden unless the experiment mode is turned on.

15. There is a list of hidden new system variables for experiments with redo log.
    We skip listing them here.

16. Created dedicated tester for redo log alone (as gtest).

17. Created doxygen documentation for the new redo log.

18. The dict_persist margin is updated when number of dirty pages is
    changed, instead of calculations on demand.

19. Mechanism used to copy last incomplete block for Clone has been changed,
    because log buffer is concurrent now.

20. Added more useful MONITOR counters for redo, including average lsn rate.

21. Introduced sharded rw-lock to have an option to stop the world in redo,
    because log_mutex is removed.

22. Invented and implemented a concurrent data structure which tracks progress
    of concurrent operations and can answer up to which point they all have been
    finished (when there is some order defined but they are allowed to be executed
    out of the order). This structure is used for concurrent writes to log buffer
    and re-used for concurrent additions to flush lists.

23. Introduced a universal mechanism to wait on event, which starts with
    provided number of spin delays, then fallbacks to waits on event,
    starting at small timeout, but increasing timeout every few waits.
    This mechanism is used in communication between user and log threads,
    and in communication between different log threads.

24. We slow-down redo log writer when there is no space in redo allowing
    checkpoints to progress and rescue the state of redo.

25. Log buffer can be resize in runtime - the size can also be decreased.

26. Simplified shutdown procedure to avoid a possible returns in logic
    to previous phases.

27. Removed concept of multiple log groups.

28. Relaxed conditions required for checkpoint_lsn. It can now point to
    any data byte within redo (does not need to point to a records group
    beginning).

29. Windows: always use buffered IO for redo log.

30. Mysql test runner received a new feature (thanks to Marcin):
    --exec_in_background.

Review: RB#15134

Reviewers:
    - Marcin Babij <marcin.babij@oracle.com>,
    - Debarun Banerjee <debarun.banerjee@oracle.com>.

Performance tests:
    - Dimitri Kravtchuk <dimitri.kravtchuk@oracle.com>,
    - Daniel Blanchard <daniel.blanchard@oracle.com>,
    - Amrendra Kumar <amrendra.x.kumar@oracle.com>.

QA and MTR tests:
    - Vinay Fisrekar <vinay.fisrekar@oracle.com>.