Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is absolutely no situation in which UTF-16 wins over UTF-8, because of the surrogate pairs required. That makes both encodings variable length.

UTF-32 is probably what you're thinking of.



I know that both encodings are variable-length. That is the issue I am trying to address.

My point is that in UTF-16 it's too easy to ignore surrogate pairs. Lots of UTF-16 software fails to handle variable-length characters because they are so rare. But in UTF-8 you can't ignore multi-byte characters without obvious bugs. These bugs are noticed and fixed more quickly than UTF-16 surrogate pair bugs. This makes UTF-8 more reliable.

I am not sure why you think I am advocating UTF-16. I said almost nothing good about it.


Bugs in UTF-8 handling of multibyte sequences need not be obvious. Google "CAPEC-80."

UTF-16 has an advantage in that there's fewer failure modes, and fewer ways for a string to be invalid.

edit: As for surrogate pairs, this is an issue, but I think it's overstated. A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.


> A naïve program may accidentally split a UTF-16 surrogate pair, but that same program is just as liable to accidentally split a decomposed character sequence in UTF-8. You have to deal with those issues regardless of encoding.

The point is that using UTF-8 makes these issues more obvious. Most programmers these days think to test with non-ascii characters. Fewer think to test with astral characters.


Anything in the range U+0800 to U+FFFF takes three bytes per character in UTF-8 and two in UTF-16 (http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings...:

"Therefore if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer then UTF-16 is more efficient. "

That same page also states: "A surprising result is that real-world documents written in languages that use characters only in the high range are still often shorter in UTF-8, due to the extensive use of spaces, digits, newlines, html markup, and embedded English words", but I think the "citation needed]" is added rightfully there (it may be close in many texts, though)


UTF-8 is variable length in that it can be anywhere from 1 to 4 bytes, while UTF-16 can either be 2 or 4. That makes a UTF-16 decoder/encoder half as complex as a UTF-8 one.


Surrogate pairs are way more complex than anything in UTF-8.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: