mysql-test/suite/innodb/t/innodb-2byte-collation.test · mysql-8.0.3 · Rasoul Jahanshahi / Mysql Server

Mar 20, 2017

Bug #25750527: STRNXFRMLEN IS TOO CONSERVATIVE FOR UNICODE 9.0.0 COLLATIONS · 70274e05

Steinar H. Gunderson authored Mar 20, 2017

Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes)
returns 32. The rationale for this choice (strnxfrm_multiply=8) is not
documented. However, there is no character with more than eight weights,
giving 16 bytes, so we are creating twice as long sort keys as we need to.

For as_cs, this is more complicated; we again have a 2x bloat factor, but we
need four static bytes for the weight separators. Thankfully, these two
effects go against each other, so the bloat absorbs the weight separators in
all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for
some characters on the primary level, but again, the extra bloat happens to
save us.

We make the bounds tighter by changing the Unicode 9.0.0 collations from
using strnxfrmlen_simple to a custom-built function that takes into account
weight separators, reordering and the likes, and set tight bounds for these
collations. We document the implicit assumptions in strnxfrmlen() about what
the input parameter actually means, and add a unit test to run every character
(by itself) through every collation to verify that the property holds.

Change-Id: I8f37a890bff146e4b1db39050b6cc274ecae0c59

70274e05

Bug #25750527: STRNXFRMLEN IS TOO CONSERVATIVE FOR UNICODE 9.0.0 COLLATIONS

Steinar H. Gunderson authored Mar 20, 2017

Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes)
returns 32. The rationale for this choice (strnxfrm_multiply=8) is not
documented. However, there is no character with more than eight weights,
giving 16 bytes, so we are creating twice as long sort keys as we need to.

For as_cs, this is more complicated; we again have a 2x bloat factor, but we
need four static bytes for the weight separators. Thankfully, these two
effects go against each other, so the bloat absorbs the weight separators in
all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for
some characters on the primary level, but again, the extra bloat happens to
save us.

We make the bounds tighter by changing the Unicode 9.0.0 collations from
using strnxfrmlen_simple to a custom-built function that takes into account
weight separators, reordering and the likes, and set tight bounds for these
collations. We document the implicit assumptions in strnxfrmlen() about what
the input parameter actually means, and add a unit test to run every character
(by itself) through every collation to verify that the property holds.

Change-Id: I8f37a890bff146e4b1db39050b6cc274ecae0c59