-
Steinar H. Gunderson authored
Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes) returns 32. The rationale for this choice (strnxfrm_multiply=8) is not documented. However, there is no character with more than eight weights, giving 16 bytes, so we are creating twice as long sort keys as we need to. For as_cs, this is more complicated; we again have a 2x bloat factor, but we need four static bytes for the weight separators. Thankfully, these two effects go against each other, so the bloat absorbs the weight separators in all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for some characters on the primary level, but again, the extra bloat happens to save us. We make the bounds tighter by changing the Unicode 9.0.0 collations from using strnxfrmlen_simple to a custom-built function that takes into account weight separators, reordering and the likes, and set tight bounds for these collations. We document the implicit assumptions in strnxfrmlen() about what the input parameter actually means, and add a unit test to run every character (by itself) through every collation to verify that the property holds. Change-Id: I8f37a890bff146e4b1db39050b6cc274ecae0c59
Steinar H. Gunderson authoredUnder utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes) returns 32. The rationale for this choice (strnxfrm_multiply=8) is not documented. However, there is no character with more than eight weights, giving 16 bytes, so we are creating twice as long sort keys as we need to. For as_cs, this is more complicated; we again have a 2x bloat factor, but we need four static bytes for the weight separators. Thankfully, these two effects go against each other, so the bloat absorbs the weight separators in all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for some characters on the primary level, but again, the extra bloat happens to save us. We make the bounds tighter by changing the Unicode 9.0.0 collations from using strnxfrmlen_simple to a custom-built function that takes into account weight separators, reordering and the likes, and set tight bounds for these collations. We document the implicit assumptions in strnxfrmlen() about what the input parameter actually means, and add a unit test to run every character (by itself) through every collation to verify that the property holds. Change-Id: I8f37a890bff146e4b1db39050b6cc274ecae0c59
Loading