-
Xing Zhang authored
We need to reorder characters to implement this Chinese collation because of the CLDR rule [reorder Han]. We divide all Unicode characters into five parts: 1. The core group (spaces and symbols). We don't change the weight of the characters in this group. They sort before all other characters as in the DUCET. 2. 41336 Han characters whose sorting order have been defined by CLDR. These characters sort after the characters of part 1. 3. All other Han characters. These characters sort after the Han characters of part 2. 4. Character groups which are between the core group and the Han group in the DUCET. We need to give them bigger weight than all Han characters. So they sort after the characters of part 3. 5. All other characters. Both CLDR v29 and v30 are incomplete and are missing some very common Han characters (like “small”). Thus we will use the zh.xml file from CLDR v33 to implement this collation. Changed uca9-dump.cc to make uca9dump can generate weight table file for Chinese and Japanese languages at build time. Chinese collation regression test added. Benchmark result comparing to the Japanese collation: BM_Chinese_AS_CS 18162 ns/iter 25.20 MB/sec BM_Japanese_AS_CS 21975 ns/iter 14.06 MB/sec Benchmark result showing its effect to other collations: BM_SimpleUTF8MB4 2199 -> 2157 ns/iter [+ 1.95%] BM_MixedUTF8MB4 1703 -> 1707 ns/iter [- 0.23%] BM_MixedUTF8MB4_AS_CI 3523 -> 3409 ns/iter [+ 3.34%] BM_MixedUTF8MB4_AS_CS 5065 -> 5049 ns/iter [+ 0.32%] BM_JapaneseUTF8MB4 3659 -> 3693 ns/iter [- 0.92%] BM_Hungarian_AS_CS 36518 -> 37603 ns/iter [- 2.89%] BM_Japanese_AS_CS 21684 -> 21880 ns/iter [- 0.90%] BM_Japanese_AS_CS_KS 29542 -> 29622 ns/iter [- 0.27%] Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56
Xing Zhang authoredWe need to reorder characters to implement this Chinese collation because of the CLDR rule [reorder Han]. We divide all Unicode characters into five parts: 1. The core group (spaces and symbols). We don't change the weight of the characters in this group. They sort before all other characters as in the DUCET. 2. 41336 Han characters whose sorting order have been defined by CLDR. These characters sort after the characters of part 1. 3. All other Han characters. These characters sort after the Han characters of part 2. 4. Character groups which are between the core group and the Han group in the DUCET. We need to give them bigger weight than all Han characters. So they sort after the characters of part 3. 5. All other characters. Both CLDR v29 and v30 are incomplete and are missing some very common Han characters (like “small”). Thus we will use the zh.xml file from CLDR v33 to implement this collation. Changed uca9-dump.cc to make uca9dump can generate weight table file for Chinese and Japanese languages at build time. Chinese collation regression test added. Benchmark result comparing to the Japanese collation: BM_Chinese_AS_CS 18162 ns/iter 25.20 MB/sec BM_Japanese_AS_CS 21975 ns/iter 14.06 MB/sec Benchmark result showing its effect to other collations: BM_SimpleUTF8MB4 2199 -> 2157 ns/iter [+ 1.95%] BM_MixedUTF8MB4 1703 -> 1707 ns/iter [- 0.23%] BM_MixedUTF8MB4_AS_CI 3523 -> 3409 ns/iter [+ 3.34%] BM_MixedUTF8MB4_AS_CS 5065 -> 5049 ns/iter [+ 0.32%] BM_JapaneseUTF8MB4 3659 -> 3693 ns/iter [- 0.92%] BM_Hungarian_AS_CS 36518 -> 37603 ns/iter [- 2.89%] BM_Japanese_AS_CS 21684 -> 21880 ns/iter [- 0.90%] BM_Japanese_AS_CS_KS 29542 -> 29622 ns/iter [- 0.27%] Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56
Loading