-
Xing Zhang authored
Per RFC 3629, UTF-8 should prohibit characters between U+D800 and U+DFFF, which are reserved for surrogate pairs and do not directly represent characters, but the validity checking function my_valid_mbcharlen_utf8mb3 doesn't do the check. Microbenchmarks (Skylake 3.4 GHz, opt, GCC 5.4): BM_SimpleUTF8 296 -> 298 [- 0.7%] BM_UTF8MB4StringLength 35 -> 35 BM_SimpleUTF8MB4 202 -> 205 [- 1.5%] BM_MixedUTF8MB4 224 -> 209 [+ 6.7%] BM_MixedUTF8MB4_AS_CS 690 -> 696 [- 0.9%] BM_UTF8_Valid_Check 339 -> 316 [+ 6.8%] Solution: Return MY_CS_ILSEQ if the character is in [U+D800, U+DFFF].
Xing Zhang authoredPer RFC 3629, UTF-8 should prohibit characters between U+D800 and U+DFFF, which are reserved for surrogate pairs and do not directly represent characters, but the validity checking function my_valid_mbcharlen_utf8mb3 doesn't do the check. Microbenchmarks (Skylake 3.4 GHz, opt, GCC 5.4): BM_SimpleUTF8 296 -> 298 [- 0.7%] BM_UTF8MB4StringLength 35 -> 35 BM_SimpleUTF8MB4 202 -> 205 [- 1.5%] BM_MixedUTF8MB4 224 -> 209 [+ 6.7%] BM_MixedUTF8MB4_AS_CS 690 -> 696 [- 0.9%] BM_UTF8_Valid_Check 339 -> 316 [+ 6.8%] Solution: Return MY_CS_ILSEQ if the character is in [U+D800, U+DFFF].
Loading