Skip to content
  • Xing Zhang's avatar
    09f92de3
    WL#11825: Add Chinese collation for utf8mb4 · 09f92de3
    Xing Zhang authored
    We need to reorder characters to implement this Chinese collation because
    of the CLDR rule [reorder Han]. We divide all Unicode characters into five
    parts:
    1. The core group (spaces and symbols). We don't change the weight of the
       characters in this group. They sort before all other characters as in
       the DUCET.
    2. 41336 Han characters whose sorting order have been defined by CLDR. These
       characters sort after the characters of part 1.
    3. All other Han characters. These characters sort after the Han characters
       of part 2.
    4. Character groups which are between the core group and the Han group in
       the DUCET. We need to give them bigger weight than all Han characters.
       So they sort after the characters of part 3.
    5. All other characters.
    
    Both CLDR v29 and v30 are incomplete and are missing some very common Han
    characters (like “small”). Thus we will use the zh.xml file from CLDR v33
    to implement this collation.
    
    Changed uca9-dump.cc to make uca9dump can generate weight table file for
    Chinese and Japanese languages at build time.
    
    Chinese collation regression test added.
    
    Benchmark result comparing to the Japanese collation:
    BM_Chinese_AS_CS     18162 ns/iter    25.20 MB/sec
    BM_Japanese_AS_CS    21975 ns/iter    14.06 MB/sec
    
    Benchmark result showing its effect to other collations:
    BM_SimpleUTF8MB4           2199 -> 2157 ns/iter   [+ 1.95%]
    BM_MixedUTF8MB4            1703 -> 1707 ns/iter   [- 0.23%]
    BM_MixedUTF8MB4_AS_CI      3523 -> 3409 ns/iter   [+ 3.34%]
    BM_MixedUTF8MB4_AS_CS      5065 -> 5049 ns/iter   [+ 0.32%]
    BM_JapaneseUTF8MB4         3659 -> 3693 ns/iter   [- 0.92%]
    BM_Hungarian_AS_CS         36518 -> 37603 ns/iter [- 2.89%]
    BM_Japanese_AS_CS          21684 -> 21880 ns/iter [- 0.90%]
    BM_Japanese_AS_CS_KS       29542 -> 29622 ns/iter [- 0.27%]
    
    Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56
    09f92de3
    WL#11825: Add Chinese collation for utf8mb4
    Xing Zhang authored
    We need to reorder characters to implement this Chinese collation because
    of the CLDR rule [reorder Han]. We divide all Unicode characters into five
    parts:
    1. The core group (spaces and symbols). We don't change the weight of the
       characters in this group. They sort before all other characters as in
       the DUCET.
    2. 41336 Han characters whose sorting order have been defined by CLDR. These
       characters sort after the characters of part 1.
    3. All other Han characters. These characters sort after the Han characters
       of part 2.
    4. Character groups which are between the core group and the Han group in
       the DUCET. We need to give them bigger weight than all Han characters.
       So they sort after the characters of part 3.
    5. All other characters.
    
    Both CLDR v29 and v30 are incomplete and are missing some very common Han
    characters (like “small”). Thus we will use the zh.xml file from CLDR v33
    to implement this collation.
    
    Changed uca9-dump.cc to make uca9dump can generate weight table file for
    Chinese and Japanese languages at build time.
    
    Chinese collation regression test added.
    
    Benchmark result comparing to the Japanese collation:
    BM_Chinese_AS_CS     18162 ns/iter    25.20 MB/sec
    BM_Japanese_AS_CS    21975 ns/iter    14.06 MB/sec
    
    Benchmark result showing its effect to other collations:
    BM_SimpleUTF8MB4           2199 -> 2157 ns/iter   [+ 1.95%]
    BM_MixedUTF8MB4            1703 -> 1707 ns/iter   [- 0.23%]
    BM_MixedUTF8MB4_AS_CI      3523 -> 3409 ns/iter   [+ 3.34%]
    BM_MixedUTF8MB4_AS_CS      5065 -> 5049 ns/iter   [+ 0.32%]
    BM_JapaneseUTF8MB4         3659 -> 3693 ns/iter   [- 0.92%]
    BM_Hungarian_AS_CS         36518 -> 37603 ns/iter [- 2.89%]
    BM_Japanese_AS_CS          21684 -> 21880 ns/iter [- 0.90%]
    BM_Japanese_AS_CS_KS       29542 -> 29622 ns/iter [- 0.27%]
    
    Change-Id: I70c3bd971c4d45ca255b8cd3406535e953e60d56
Loading