Skip to content
  • Marc Alff's avatar
    af2431ff
    BUG#22246001 UNDOCUMENTED NEGATIVE LOW ESTIMATES IN P_S.MEMORY_SUMMARY_GLOBAL_BY_EVENT_NAME · af2431ff
    Marc Alff authored
    Problem:
    ========
    
    When monitoring the content of table
    performance_schema.memory_summary_global_by_event_name,
    with a production workload:
    - low water marks (LOW_COUNT_USED, LOW_NUMBER_OF_BYTES_USED)
      can have negative values
    - high watermarks (HIGH_COUNT_USED, HIGH_NUMBER_OF_BYTES_USED)
      have forever growing values, even when the server
      does not really keep increasing memory usage.
    
    Context:
    ========
    
    The memory instrumentation maintains statistics as described below.
    
    When an ALLOC is instrumented:
    - the current count / size is increased
    - the 'alloc count / size capacity',
      which represents the distance between the current level
      and the high water mark, is (conditionally) reduced.
    - the 'free count / size capacity',
      which represents the distance between the current level
      and the low water mark, is (always) increased.
    
    Likewise, when a FREE is instrumented:
    - the current count / size is decreased
    - the 'alloc count / size capacity'
      is (always) increased.
    - the 'free count / size capacity'
      is (conditionally) increased.
    
    When an ALLOC followed by a corresponding FREE is instrumented,
    for the *same* stat buffer:
    - the current count / size is unchanged
    - the high watermark may be increased to reflect the top memory usage
    - the low watermark is unchanged.
    
    When repeating N balanced ALLOC + FREE operations as part of a workload,
    still for the same stat buffer:
    - the current count / size is unchanged,
    - the high watermark reflects the top usage, and is stable,
    - the low watermark reflects the bottom usage, and is stable.
    
    Root cause:
    ===========
    
    For some memory instruments however, the pattern seen at runtime is
    different.
    
    For memory instruments which are not flagged 'global',
    statistics are maintained per thread, then account / user / host,
    and then finally in a global buffer.
    
    For workloads where one thread acts as a producer of data,
    and *another* thread acts as a consumer of that data,
    and if the piece of memory transferred between the producer
    and the consumer is:
    - never UN CLAIMED (as in pfs_memory_claim_vc(false)) by the producer
    - never CLAIMED (as in pfs_memory_claim_vc(true)) by the consumer
    then the instrumentation behaves as follows.
    
    For the producer thread (here, the workload):
    - the current count / size continuously grows,
      because the workload is not balanced
    - the 'alloc count / size capacity' is reduced to 0 as the current level grows,
      which reflects that this thread is and stays at the high water mark
    - the 'free count / size capacity' is continuously growing,
      which reflects that the current level is further and further away
      from the low water mark
    
    For the consumer thread (typically a system thread like innodb cleaners):
    - the current count / size continuously decreases, is negative,
      and continues top decrease without limit,
    - the 'free count / size capacity' is reduced to 0 as the current level
      decreases,
      which reflects that this thread is and stays at the low water mark
    - the 'alloc count / size capacity' is continuously growing,
      which reflects that the current level is further and further away
      from the high water mark
    
    When aggregating each parts to table memory_summary_global_by_event_name:
    - the current count / size balances are resolved, and the result is accurate
    - the high watermark diverges to plus infinity,
      leading to forever growing high water marks
    - the low watermark diverges to negative infinity,
      leading to negative and forever decreasing low water marks
    
    The fundamental root cause is that the system fails to reconcile
    ALLOC and FREE statistics counted in different buckets,
    when a producer thread ALLOC and gives (without telling the instrumentation)
    and memory to FREE by a consumer thread.
    
    Fix:
    ====
    
    A)
    
    A preliminary fix is a refactoring,
    to move complex inline code from pfs_stat.h
    to a new implementation file pfs_stat.cc.
    
    The functional fix itself is composed of several parts.
    
    B)
    
    Without explicit instrumentation (UN CLAIM),
    it is impossible to know if allocated memory will be
    later freed by the workload, or contributed by a producer thread,
    so alloc stats are unchanged when a thread runs.
    
    Upon thread termination however,
    when a thread statistics are aggregated to a parent
    account / user / host,
    the instrumentation:
    - inspects the thread balance
    - detects if the thread contributes net memory
    - aggregates the net memory contributed directly to
      the global buffer, instead of the parent.
    
    This is to avoid having intermediate buffers
    (account / user / host) hold statistics for unbalanced contributions.
    
    C)
    
    Given that the memory instrumentation marks an allocated block
    with the owner, it is possible upon FREE to detect that
    a consumer thread releases memory it did not allocate in the first place.
    
    In this case, in pfs_memory_free_vc(),
    the free operation is counted against the global bucket directly.
    
    The intent of B) and C) is to force the system to use the *same*
    statistics bucket in the producer / consumer scenario.
    
    D)
    
    For every memory summary tables:
    - memory_summary_by_thread_by_event_name
    - memory_summary_by_account_by_event_name
    - memory_summary_by_user_by_event_name
    - memory_summary_by_host_by_event_name
    - memory_summary_global_by_event_name
    the aggregation code is more complex,
    and inspects intermediate statistics closer,
    while performing sums.
    
    In particular, when a given bucket contains a net loss
    (current count / size is negative),
    we know that the FREE operation counted did not
    consume 'free count / size capacity' because there was none
    present from a previous ALLOC,
    so the capacity that should have been consumed is counted in
    members m_missing_free_count_capacity (respectively
    m_missing_free_size_capacity).
    
    Once the aggregation is complete,
    which means that the other bucket containing the matching net gain
    was also aggregated, a final normalization step occurs,
    to consume the low watermarks collected with the missing capacity collected.
    
    This helps to balance the low watermarks.
    
    E)
    
    In table memory_summary_global_by_event_name,
    low watermarks are by definition positive or zero,
    so the code adjust them if low estimates diverged.
    
    F)
    
    Finally, a performance improvement fix is implemented.
    
    When extending watermarks with the former ::apply_delta() method,
    the same code was used to adjust both high and low watermarks.
    
    This is un necessary, causing too much code to be executed,
    as only one watermark is moved at a time: the instrumentation
    records either an ALLOC or a FREE, but never both.
    
    As a result, apply_delta() is split into apply_alloc_delta()
    and apply_free_delta(), and all the calling code adjusted.
    
    A side effect is that realloc is no more instrumented as a realloc
    operation.
    
    It is now an ALLOC(new size) followed by a FREE(old size),
    which is actually more correct for the high water marks,
    because the total memory usage can reach old size + new size.
    
    Approved by: Chris Powers <chris.powers@oracle.com>
    af2431ff
    BUG#22246001 UNDOCUMENTED NEGATIVE LOW ESTIMATES IN P_S.MEMORY_SUMMARY_GLOBAL_BY_EVENT_NAME
    Marc Alff authored
    Problem:
    ========
    
    When monitoring the content of table
    performance_schema.memory_summary_global_by_event_name,
    with a production workload:
    - low water marks (LOW_COUNT_USED, LOW_NUMBER_OF_BYTES_USED)
      can have negative values
    - high watermarks (HIGH_COUNT_USED, HIGH_NUMBER_OF_BYTES_USED)
      have forever growing values, even when the server
      does not really keep increasing memory usage.
    
    Context:
    ========
    
    The memory instrumentation maintains statistics as described below.
    
    When an ALLOC is instrumented:
    - the current count / size is increased
    - the 'alloc count / size capacity',
      which represents the distance between the current level
      and the high water mark, is (conditionally) reduced.
    - the 'free count / size capacity',
      which represents the distance between the current level
      and the low water mark, is (always) increased.
    
    Likewise, when a FREE is instrumented:
    - the current count / size is decreased
    - the 'alloc count / size capacity'
      is (always) increased.
    - the 'free count / size capacity'
      is (conditionally) increased.
    
    When an ALLOC followed by a corresponding FREE is instrumented,
    for the *same* stat buffer:
    - the current count / size is unchanged
    - the high watermark may be increased to reflect the top memory usage
    - the low watermark is unchanged.
    
    When repeating N balanced ALLOC + FREE operations as part of a workload,
    still for the same stat buffer:
    - the current count / size is unchanged,
    - the high watermark reflects the top usage, and is stable,
    - the low watermark reflects the bottom usage, and is stable.
    
    Root cause:
    ===========
    
    For some memory instruments however, the pattern seen at runtime is
    different.
    
    For memory instruments which are not flagged 'global',
    statistics are maintained per thread, then account / user / host,
    and then finally in a global buffer.
    
    For workloads where one thread acts as a producer of data,
    and *another* thread acts as a consumer of that data,
    and if the piece of memory transferred between the producer
    and the consumer is:
    - never UN CLAIMED (as in pfs_memory_claim_vc(false)) by the producer
    - never CLAIMED (as in pfs_memory_claim_vc(true)) by the consumer
    then the instrumentation behaves as follows.
    
    For the producer thread (here, the workload):
    - the current count / size continuously grows,
      because the workload is not balanced
    - the 'alloc count / size capacity' is reduced to 0 as the current level grows,
      which reflects that this thread is and stays at the high water mark
    - the 'free count / size capacity' is continuously growing,
      which reflects that the current level is further and further away
      from the low water mark
    
    For the consumer thread (typically a system thread like innodb cleaners):
    - the current count / size continuously decreases, is negative,
      and continues top decrease without limit,
    - the 'free count / size capacity' is reduced to 0 as the current level
      decreases,
      which reflects that this thread is and stays at the low water mark
    - the 'alloc count / size capacity' is continuously growing,
      which reflects that the current level is further and further away
      from the high water mark
    
    When aggregating each parts to table memory_summary_global_by_event_name:
    - the current count / size balances are resolved, and the result is accurate
    - the high watermark diverges to plus infinity,
      leading to forever growing high water marks
    - the low watermark diverges to negative infinity,
      leading to negative and forever decreasing low water marks
    
    The fundamental root cause is that the system fails to reconcile
    ALLOC and FREE statistics counted in different buckets,
    when a producer thread ALLOC and gives (without telling the instrumentation)
    and memory to FREE by a consumer thread.
    
    Fix:
    ====
    
    A)
    
    A preliminary fix is a refactoring,
    to move complex inline code from pfs_stat.h
    to a new implementation file pfs_stat.cc.
    
    The functional fix itself is composed of several parts.
    
    B)
    
    Without explicit instrumentation (UN CLAIM),
    it is impossible to know if allocated memory will be
    later freed by the workload, or contributed by a producer thread,
    so alloc stats are unchanged when a thread runs.
    
    Upon thread termination however,
    when a thread statistics are aggregated to a parent
    account / user / host,
    the instrumentation:
    - inspects the thread balance
    - detects if the thread contributes net memory
    - aggregates the net memory contributed directly to
      the global buffer, instead of the parent.
    
    This is to avoid having intermediate buffers
    (account / user / host) hold statistics for unbalanced contributions.
    
    C)
    
    Given that the memory instrumentation marks an allocated block
    with the owner, it is possible upon FREE to detect that
    a consumer thread releases memory it did not allocate in the first place.
    
    In this case, in pfs_memory_free_vc(),
    the free operation is counted against the global bucket directly.
    
    The intent of B) and C) is to force the system to use the *same*
    statistics bucket in the producer / consumer scenario.
    
    D)
    
    For every memory summary tables:
    - memory_summary_by_thread_by_event_name
    - memory_summary_by_account_by_event_name
    - memory_summary_by_user_by_event_name
    - memory_summary_by_host_by_event_name
    - memory_summary_global_by_event_name
    the aggregation code is more complex,
    and inspects intermediate statistics closer,
    while performing sums.
    
    In particular, when a given bucket contains a net loss
    (current count / size is negative),
    we know that the FREE operation counted did not
    consume 'free count / size capacity' because there was none
    present from a previous ALLOC,
    so the capacity that should have been consumed is counted in
    members m_missing_free_count_capacity (respectively
    m_missing_free_size_capacity).
    
    Once the aggregation is complete,
    which means that the other bucket containing the matching net gain
    was also aggregated, a final normalization step occurs,
    to consume the low watermarks collected with the missing capacity collected.
    
    This helps to balance the low watermarks.
    
    E)
    
    In table memory_summary_global_by_event_name,
    low watermarks are by definition positive or zero,
    so the code adjust them if low estimates diverged.
    
    F)
    
    Finally, a performance improvement fix is implemented.
    
    When extending watermarks with the former ::apply_delta() method,
    the same code was used to adjust both high and low watermarks.
    
    This is un necessary, causing too much code to be executed,
    as only one watermark is moved at a time: the instrumentation
    records either an ALLOC or a FREE, but never both.
    
    As a result, apply_delta() is split into apply_alloc_delta()
    and apply_free_delta(), and all the calling code adjusted.
    
    A side effect is that realloc is no more instrumented as a realloc
    operation.
    
    It is now an ALLOC(new size) followed by a FREE(old size),
    which is actually more correct for the high water marks,
    because the total memory usage can reach old size + new size.
    
    Approved by: Chris Powers <chris.powers@oracle.com>
Loading