Skip to content
  • Xing Zhang's avatar
    b66573fd
    Bug#24672415: UTF-8 ENCODING ACCEPTS SURROGATE PAIRS AS VALID · b66573fd
    Xing Zhang authored
    Per RFC 3629, UTF-8 should prohibit characters between U+D800
    and U+DFFF, which are reserved for surrogate pairs and do not
    directly represent characters, but the validity checking
    function my_valid_mbcharlen_utf8mb3 doesn't do the check.
    
    Microbenchmarks (Skylake 3.4 GHz, opt, GCC 5.4):
        BM_SimpleUTF8             296 -> 298 [- 0.7%]
        BM_UTF8MB4StringLength    35  -> 35
        BM_SimpleUTF8MB4          202 -> 205 [- 1.5%]
        BM_MixedUTF8MB4           224 -> 209 [+ 6.7%]
        BM_MixedUTF8MB4_AS_CS     690 -> 696 [- 0.9%]
        BM_UTF8_Valid_Check       339 -> 316 [+ 6.8%]
    
    Solution:
    Return MY_CS_ILSEQ if the character is in [U+D800, U+DFFF].
    b66573fd
    Bug#24672415: UTF-8 ENCODING ACCEPTS SURROGATE PAIRS AS VALID
    Xing Zhang authored
    Per RFC 3629, UTF-8 should prohibit characters between U+D800
    and U+DFFF, which are reserved for surrogate pairs and do not
    directly represent characters, but the validity checking
    function my_valid_mbcharlen_utf8mb3 doesn't do the check.
    
    Microbenchmarks (Skylake 3.4 GHz, opt, GCC 5.4):
        BM_SimpleUTF8             296 -> 298 [- 0.7%]
        BM_UTF8MB4StringLength    35  -> 35
        BM_SimpleUTF8MB4          202 -> 205 [- 1.5%]
        BM_MixedUTF8MB4           224 -> 209 [+ 6.7%]
        BM_MixedUTF8MB4_AS_CS     690 -> 696 [- 0.9%]
        BM_UTF8_Valid_Check       339 -> 316 [+ 6.8%]
    
    Solution:
    Return MY_CS_ILSEQ if the character is in [U+D800, U+DFFF].
Loading