storage/perfschema/table_helper.cc · 7ed30a748964c009d4909cb8b4b22036ebdef239 · Rasoul Jahanshahi / Mysql Server

Sep 21, 2020

BUG#22246001 UNDOCUMENTED NEGATIVE LOW ESTIMATES IN P_S.MEMORY_SUMMARY_GLOBAL_BY_EVENT_NAME · af2431ff

Marc Alff authored Sep 21, 2020

Problem:
========

When monitoring the content of table
performance_schema.memory_summary_global_by_event_name,
with a production workload:
- low water marks (LOW_COUNT_USED, LOW_NUMBER_OF_BYTES_USED)
  can have negative values
- high watermarks (HIGH_COUNT_USED, HIGH_NUMBER_OF_BYTES_USED)
  have forever growing values, even when the server
  does not really keep increasing memory usage.

Context:
========

The memory instrumentation maintains statistics as described below.

When an ALLOC is instrumented:
- the current count / size is increased
- the 'alloc count / size capacity',
  which represents the distance between the current level
  and the high water mark, is (conditionally) reduced.
- the 'free count / size capacity',
  which represents the distance between the current level
  and the low water mark, is (always) increased.

Likewise, when a FREE is instrumented:
- the current count / size is decreased
- the 'alloc count / size capacity'
  is (always) increased.
- the 'free count / size capacity'
  is (conditionally) increased.

When an ALLOC followed by a corresponding FREE is instrumented,
for the *same* stat buffer:
- the current count / size is unchanged
- the high watermark may be increased to reflect the top memory usage
- the low watermark is unchanged.

When repeating N balanced ALLOC + FREE operations as part of a workload,
still for the same stat buffer:
- the current count / size is unchanged,
- the high watermark reflects the top usage, and is stable,
- the low watermark reflects the bottom usage, and is stable.

Root cause:
===========

For some memory instruments however, the pattern seen at runtime is
different.

For memory instruments which are not flagged 'global',
statistics are maintained per thread, then account / user / host,
and then finally in a global buffer.

For workloads where one thread acts as a producer of data,
and *another* thread acts as a consumer of that data,
and if the piece of memory transferred between the producer
and the consumer is:
- never UN CLAIMED (as in pfs_memory_claim_vc(false)) by the producer
- never CLAIMED (as in pfs_memory_claim_vc(true)) by the consumer
then the instrumentation behaves as follows.

For the producer thread (here, the workload):
- the current count / size continuously grows,
  because the workload is not balanced
- the 'alloc count / size capacity' is reduced to 0 as the current level grows,
  which reflects that this thread is and stays at the high water mark
- the 'free count / size capacity' is continuously growing,
  which reflects that the current level is further and further away
  from the low water mark

For the consumer thread (typically a system thread like innodb cleaners):
- the current count / size continuously decreases, is negative,
  and continues top decrease without limit,
- the 'free count / size capacity' is reduced to 0 as the current level
  decreases,
  which reflects that this thread is and stays at the low water mark
- the 'alloc count / size capacity' is continuously growing,
  which reflects that the current level is further and further away
  from the high water mark

When aggregating each parts to table memory_summary_global_by_event_name:
- the current count / size balances are resolved, and the result is accurate
- the high watermark diverges to plus infinity,
  leading to forever growing high water marks
- the low watermark diverges to negative infinity,
  leading to negative and forever decreasing low water marks

The fundamental root cause is that the system fails to reconcile
ALLOC and FREE statistics counted in different buckets,
when a producer thread ALLOC and gives (without telling the instrumentation)
and memory to FREE by a consumer thread.

Fix:
====

A)

A preliminary fix is a refactoring,
to move complex inline code from pfs_stat.h
to a new implementation file pfs_stat.cc.

The functional fix itself is composed of several parts.

B)

Without explicit instrumentation (UN CLAIM),
it is impossible to know if allocated memory will be
later freed by the workload, or contributed by a producer thread,
so alloc stats are unchanged when a thread runs.

Upon thread termination however,
when a thread statistics are aggregated to a parent
account / user / host,
the instrumentation:
- inspects the thread balance
- detects if the thread contributes net memory
- aggregates the net memory contributed directly to
  the global buffer, instead of the parent.

This is to avoid having intermediate buffers
(account / user / host) hold statistics for unbalanced contributions.

C)

Given that the memory instrumentation marks an allocated block
with the owner, it is possible upon FREE to detect that
a consumer thread releases memory it did not allocate in the first place.

In this case, in pfs_memory_free_vc(),
the free operation is counted against the global bucket directly.

The intent of B) and C) is to force the system to use the *same*
statistics bucket in the producer / consumer scenario.

D)

For every memory summary tables:
- memory_summary_by_thread_by_event_name
- memory_summary_by_account_by_event_name
- memory_summary_by_user_by_event_name
- memory_summary_by_host_by_event_name
- memory_summary_global_by_event_name
the aggregation code is more complex,
and inspects intermediate statistics closer,
while performing sums.

In particular, when a given bucket contains a net loss
(current count / size is negative),
we know that the FREE operation counted did not
consume 'free count / size capacity' because there was none
present from a previous ALLOC,
so the capacity that should have been consumed is counted in
members m_missing_free_count_capacity (respectively
m_missing_free_size_capacity).

Once the aggregation is complete,
which means that the other bucket containing the matching net gain
was also aggregated, a final normalization step occurs,
to consume the low watermarks collected with the missing capacity collected.

This helps to balance the low watermarks.

E)

In table memory_summary_global_by_event_name,
low watermarks are by definition positive or zero,
so the code adjust them if low estimates diverged.

F)

Finally, a performance improvement fix is implemented.

When extending watermarks with the former ::apply_delta() method,
the same code was used to adjust both high and low watermarks.

This is un necessary, causing too much code to be executed,
as only one watermark is moved at a time: the instrumentation
records either an ALLOC or a FREE, but never both.

As a result, apply_delta() is split into apply_alloc_delta()
and apply_free_delta(), and all the calling code adjusted.

A side effect is that realloc is no more instrumented as a realloc
operation.

It is now an ALLOC(new size) followed by a FREE(old size),
which is actually more correct for the high water marks,
because the total memory usage can reach old size + new size.

Approved by: Chris Powers <chris.powers@oracle.com>

af2431ff

BUG#22246001 UNDOCUMENTED NEGATIVE LOW ESTIMATES IN P_S.MEMORY_SUMMARY_GLOBAL_BY_EVENT_NAME

Marc Alff authored Sep 21, 2020

Problem:
========

When monitoring the content of table
performance_schema.memory_summary_global_by_event_name,
with a production workload:
- low water marks (LOW_COUNT_USED, LOW_NUMBER_OF_BYTES_USED)
  can have negative values
- high watermarks (HIGH_COUNT_USED, HIGH_NUMBER_OF_BYTES_USED)
  have forever growing values, even when the server
  does not really keep increasing memory usage.

Context:
========

The memory instrumentation maintains statistics as described below.

When an ALLOC is instrumented:
- the current count / size is increased
- the 'alloc count / size capacity',
  which represents the distance between the current level
  and the high water mark, is (conditionally) reduced.
- the 'free count / size capacity',
  which represents the distance between the current level
  and the low water mark, is (always) increased.

Likewise, when a FREE is instrumented:
- the current count / size is decreased
- the 'alloc count / size capacity'
  is (always) increased.
- the 'free count / size capacity'
  is (conditionally) increased.

When an ALLOC followed by a corresponding FREE is instrumented,
for the *same* stat buffer:
- the current count / size is unchanged
- the high watermark may be increased to reflect the top memory usage
- the low watermark is unchanged.

When repeating N balanced ALLOC + FREE operations as part of a workload,
still for the same stat buffer:
- the current count / size is unchanged,
- the high watermark reflects the top usage, and is stable,
- the low watermark reflects the bottom usage, and is stable.

Root cause:
===========

For some memory instruments however, the pattern seen at runtime is
different.

For memory instruments which are not flagged 'global',
statistics are maintained per thread, then account / user / host,
and then finally in a global buffer.

For workloads where one thread acts as a producer of data,
and *another* thread acts as a consumer of that data,
and if the piece of memory transferred between the producer
and the consumer is:
- never UN CLAIMED (as in pfs_memory_claim_vc(false)) by the producer
- never CLAIMED (as in pfs_memory_claim_vc(true)) by the consumer
then the instrumentation behaves as follows.

For the producer thread (here, the workload):
- the current count / size continuously grows,
  because the workload is not balanced
- the 'alloc count / size capacity' is reduced to 0 as the current level grows,
  which reflects that this thread is and stays at the high water mark
- the 'free count / size capacity' is continuously growing,
  which reflects that the current level is further and further away
  from the low water mark

For the consumer thread (typically a system thread like innodb cleaners):
- the current count / size continuously decreases, is negative,
  and continues top decrease without limit,
- the 'free count / size capacity' is reduced to 0 as the current level
  decreases,
  which reflects that this thread is and stays at the low water mark
- the 'alloc count / size capacity' is continuously growing,
  which reflects that the current level is further and further away
  from the high water mark

When aggregating each parts to table memory_summary_global_by_event_name:
- the current count / size balances are resolved, and the result is accurate
- the high watermark diverges to plus infinity,
  leading to forever growing high water marks
- the low watermark diverges to negative infinity,
  leading to negative and forever decreasing low water marks

The fundamental root cause is that the system fails to reconcile
ALLOC and FREE statistics counted in different buckets,
when a producer thread ALLOC and gives (without telling the instrumentation)
and memory to FREE by a consumer thread.

Fix:
====

A)

A preliminary fix is a refactoring,
to move complex inline code from pfs_stat.h
to a new implementation file pfs_stat.cc.

The functional fix itself is composed of several parts.

B)

Without explicit instrumentation (UN CLAIM),
it is impossible to know if allocated memory will be
later freed by the workload, or contributed by a producer thread,
so alloc stats are unchanged when a thread runs.

Upon thread termination however,
when a thread statistics are aggregated to a parent
account / user / host,
the instrumentation:
- inspects the thread balance
- detects if the thread contributes net memory
- aggregates the net memory contributed directly to
  the global buffer, instead of the parent.

This is to avoid having intermediate buffers
(account / user / host) hold statistics for unbalanced contributions.

C)

Given that the memory instrumentation marks an allocated block
with the owner, it is possible upon FREE to detect that
a consumer thread releases memory it did not allocate in the first place.

In this case, in pfs_memory_free_vc(),
the free operation is counted against the global bucket directly.

The intent of B) and C) is to force the system to use the *same*
statistics bucket in the producer / consumer scenario.

D)

For every memory summary tables:
- memory_summary_by_thread_by_event_name
- memory_summary_by_account_by_event_name
- memory_summary_by_user_by_event_name
- memory_summary_by_host_by_event_name
- memory_summary_global_by_event_name
the aggregation code is more complex,
and inspects intermediate statistics closer,
while performing sums.

In particular, when a given bucket contains a net loss
(current count / size is negative),
we know that the FREE operation counted did not
consume 'free count / size capacity' because there was none
present from a previous ALLOC,
so the capacity that should have been consumed is counted in
members m_missing_free_count_capacity (respectively
m_missing_free_size_capacity).

Once the aggregation is complete,
which means that the other bucket containing the matching net gain
was also aggregated, a final normalization step occurs,
to consume the low watermarks collected with the missing capacity collected.

This helps to balance the low watermarks.

E)

In table memory_summary_global_by_event_name,
low watermarks are by definition positive or zero,
so the code adjust them if low estimates diverged.

F)

Finally, a performance improvement fix is implemented.

When extending watermarks with the former ::apply_delta() method,
the same code was used to adjust both high and low watermarks.

This is un necessary, causing too much code to be executed,
as only one watermark is moved at a time: the instrumentation
records either an ALLOC or a FREE, but never both.

As a result, apply_delta() is split into apply_alloc_delta()
and apply_free_delta(), and all the calling code adjusted.

A side effect is that realloc is no more instrumented as a realloc
operation.

It is now an ALLOC(new size) followed by a FREE(old size),
which is actually more correct for the high water marks,
because the total memory usage can reach old size + new size.

Approved by: Chris Powers <chris.powers@oracle.com>