Skip to content
  • Steinar H. Gunderson's avatar
    70274e05
    Bug #25750527: STRNXFRMLEN IS TOO CONSERVATIVE FOR UNICODE 9.0.0 COLLATIONS · 70274e05
    Steinar H. Gunderson authored
    Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes)
    returns 32. The rationale for this choice (strnxfrm_multiply=8) is not
    documented. However, there is no character with more than eight weights,
    giving 16 bytes, so we are creating twice as long sort keys as we need to.
    
    For as_cs, this is more complicated; we again have a 2x bloat factor, but we
    need four static bytes for the weight separators. Thankfully, these two
    effects go against each other, so the bloat absorbs the weight separators in
    all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for
    some characters on the primary level, but again, the extra bloat happens to
    save us.
    
    We make the bounds tighter by changing the Unicode 9.0.0 collations from
    using strnxfrmlen_simple to a custom-built function that takes into account
    weight separators, reordering and the likes, and set tight bounds for these
    collations. We document the implicit assumptions in strnxfrmlen() about what
    the input parameter actually means, and add a unit test to run every character
    (by itself) through every collation to verify that the property holds.
    
    Change-Id: I8f37a890bff146e4b1db39050b6cc274ecae0c59
    70274e05
    Bug #25750527: STRNXFRMLEN IS TOO CONSERVATIVE FOR UNICODE 9.0.0 COLLATIONS
    Steinar H. Gunderson authored
    Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes)
    returns 32. The rationale for this choice (strnxfrm_multiply=8) is not
    documented. However, there is no character with more than eight weights,
    giving 16 bytes, so we are creating twice as long sort keys as we need to.
    
    For as_cs, this is more complicated; we again have a 2x bloat factor, but we
    need four static bytes for the weight separators. Thankfully, these two
    effects go against each other, so the bloat absorbs the weight separators in
    all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for
    some characters on the primary level, but again, the extra bloat happens to
    save us.
    
    We make the bounds tighter by changing the Unicode 9.0.0 collations from
    using strnxfrmlen_simple to a custom-built function that takes into account
    weight separators, reordering and the likes, and set tight bounds for these
    collations. We document the implicit assumptions in strnxfrmlen() about what
    the input parameter actually means, and add a unit test to run every character
    (by itself) through every collation to verify that the property holds.
    
    Change-Id: I8f37a890bff146e4b1db39050b6cc274ecae0c59
Loading