strings/lang_data/zh_hans.txt · 3e90d07c3578e4da39dc1bce73559bbdf655c28c · Rasoul Jahanshahi / Mysql Server

Dec 13, 2018

WL#11825: Add Chinese collation for utf8mb4 · 09f92de3

Xing Zhang authored Dec 13, 2018

We need to reorder characters to implement this Chinese collation because
of the CLDR rule [reorder Han]. We divide all Unicode characters into five
parts:
1. The core group (spaces and symbols). We don't change the weight of the
   characters in this group. They sort before all other characters as in
   the DUCET.
2. 41336 Han characters whose sorting order have been defined by CLDR. These
   characters sort after the characters of part 1.
3. All other Han characters. These characters sort after the Han characters
   of part 2.
4. Character groups which are between the core group and the Han group in
   the DUCET. We need to give them bigger weight than all Han characters.
   So they sort after the characters of part 3.
5. All other characters.

Both CLDR v29 and v30 are incomplete and are missing some very common Han
characters (like “small”). Thus we will use the zh.xml file from CLDR v33
to implement this collation.

Changed uca9-dump.cc to make uca9dump can generate weight table file for
Chinese and Japanese languages at build time.

Chinese collation regression test added.

Benchmark result comparing to the Japanese collation:
BM_Chinese_AS_CS     18162 ns/iter    25.20 MB/sec
BM_Japanese_AS_CS    21975 ns/iter    14.06 MB/sec

Benchmark result showing its effect to other collations:
BM_SimpleUTF8MB4           2199 -> 2157 ns/iter   [+ 1.95%]
BM_MixedUTF8MB4            1703 -> 1707 ns/iter   [- 0.23%]
BM_MixedUTF8MB4_AS_CI      3523 -> 3409 ns/iter   [+ 3.34%]
BM_MixedUTF8MB4_AS_CS      5065 -> 5049 ns/iter   [+ 0.32%]
BM_JapaneseUTF8MB4         3659 -> 3693 ns/iter   [- 0.92%]
BM_Hungarian_AS_CS         36518 -> 37603 ns/iter [- 2.89%]
BM_Japanese_AS_CS          21684 -> 21880 ns/iter [- 0.90%]
BM_Japanese_AS_CS_KS       29542 -> 29622 ns/iter [- 0.27%]

Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56

09f92de3

WL#11825: Add Chinese collation for utf8mb4

Xing Zhang authored Dec 13, 2018

We need to reorder characters to implement this Chinese collation because
of the CLDR rule [reorder Han]. We divide all Unicode characters into five
parts:
1. The core group (spaces and symbols). We don't change the weight of the
   characters in this group. They sort before all other characters as in
   the DUCET.
2. 41336 Han characters whose sorting order have been defined by CLDR. These
   characters sort after the characters of part 1.
3. All other Han characters. These characters sort after the Han characters
   of part 2.
4. Character groups which are between the core group and the Han group in
   the DUCET. We need to give them bigger weight than all Han characters.
   So they sort after the characters of part 3.
5. All other characters.

Both CLDR v29 and v30 are incomplete and are missing some very common Han
characters (like “small”). Thus we will use the zh.xml file from CLDR v33
to implement this collation.

Changed uca9-dump.cc to make uca9dump can generate weight table file for
Chinese and Japanese languages at build time.

Chinese collation regression test added.

Benchmark result comparing to the Japanese collation:
BM_Chinese_AS_CS     18162 ns/iter    25.20 MB/sec
BM_Japanese_AS_CS    21975 ns/iter    14.06 MB/sec

Benchmark result showing its effect to other collations:
BM_SimpleUTF8MB4           2199 -> 2157 ns/iter   [+ 1.95%]
BM_MixedUTF8MB4            1703 -> 1707 ns/iter   [- 0.23%]
BM_MixedUTF8MB4_AS_CI      3523 -> 3409 ns/iter   [+ 3.34%]
BM_MixedUTF8MB4_AS_CS      5065 -> 5049 ns/iter   [+ 0.32%]
BM_JapaneseUTF8MB4         3659 -> 3693 ns/iter   [- 0.92%]
BM_Hungarian_AS_CS         36518 -> 37603 ns/iter [- 2.89%]
BM_Japanese_AS_CS          21684 -> 21880 ns/iter [- 0.90%]
BM_Japanese_AS_CS_KS       29542 -> 29622 ns/iter [- 0.27%]

Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56