Broad backwards compatibility with ASCII is a strong reason to prefer UTF-8 in m...

othermaciej · on Feb 8, 2011

> "UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.

In some contexts that may matter. In other cases, you can expect enough ASCII mixed in to outweigh that effect. For example, on the Web though, if you're sending HTML, the savings from using 8 bits for the ASCII tags will nearly outweigh the cost of extra bytes for content text. Gzip shrinks that further to essentially no difference.

fnl · on Feb 8, 2011

As I posted there, too, he also missed one very important plus in UTF-16: character counting. If you ensure there are no Surrogate Chars (which is a simple byte-mask comparison using every second byte (only!) as in: "0xD800 & byte == 0xD800" [using the entire range to avoid having to think about byte order]) in your array, you can be 100% sure about the character length just by dividing the byte-length in two. With UTF-8, you have to look at each and every byte one by one and count the number of characters to establish the actual char-length. Finally, given how rare the Supplementary Range characters are, the Surrogate Range check can be safely skipped in many scenarios, making it significantly faster to establish char lengths in UTF-16 than UTF-8.

EDIT: oh, and before any "semantic-nerd" comes along: I am fully aware that 0xXXXX are two bytes, so, if you want, read "two-byte" for every time I mention "byte" above... (doh ;))

thristian · on Feb 8, 2011

If you can guarantee there are no surrogate bytes, then you're working with UCS-2 rather than UTF-16.

If you can guarantee there are no bytes with the high bit set (a simple byte mask comparison which can be done 32 bits at a time for speed), then UTF-8 devolves to ASCII and you can calculate character-length without any division at all!

Counting the number of characters in a string is a misleading benchmark, because it's not a thing that people need to do terribly often - if you're rendering text to a display, "character count" is not enough, you need to know things like font metrics and which characters are combining, zero-width, or double-width. Concatenating strings is easy, and only cares about byte length, not character length. And if you really, really find character-length calculations on your hot-path, you can always just store length-prefixed strings instead of terminated strings.

fnl · on Feb 8, 2011

Quite on the contrary; I work in data mining and need to count characters on many different occasions. Probably for web development you don't need to care about that stuff, right (but read the second paragraph first!). So if character counting is not important in your app, you can naturally use whatever you want. If it is, however, UTF-16 is far more efficient, while not having to waste space with UTF-32 if you can be quite sure that the Supp. Range can be safely ignored or by adding a trivial safeguard.

And the "ASCII argument" is quite antiquated, almost all (scientific) char data I have contains non-ASCII characters, especially greek letters and such. Same goes even for web development or practically any user-oriented job where you need to accomodate for an international bunch of users. If you have pure ASCII data, lucky you, but why bother with UTF-anything then??

fauigerzigerk · on Feb 8, 2011

I do a lot of data mining and text mining as well. My biggest concern is always memory usage and UTF-8 is hugely more memory efficient than UTF-16 in almost all cases, even for Asian text. Any difference in character counting performance is negligible compared to the benefit of avoiding disk access.

Also, the text I process is never UTF-16 at the source. So even if I used UTF-16, I would have to convert the text first and that would be the one and only time characters are counted. There would be no additional counting overhead at all.

roel_v · on Feb 8, 2011

Re: char counting, as usual it depends on the application, but as the previous poster remarked you can keep a separate counter for O(1) length access, at least in situations where strings are relatively constant. I agree the author of the article omitted this point and that I personally am not as black and white in the 'UTF-8 is better' camp, but just string length is imo, usually, not an unsolvable bottleneck.

fnl · on Feb 8, 2011

Well, I am neither black-and-white with the topic. Both have valid reasons for use; UTF-8 in general is more efficient in space for any use where ASCII characters are predominant. UTF-16 is more efficient if non-Latin-1 characters are predominant. UTF-8 has backwards compatibility to ASCII if that is important. UTF-16 makes it easier and faster to count characters, take slices of characters, and scan for characters (and, so far, I insist no valid argument why this is the case has ever been produced - all UTF-8 solutions are hacky and/or harder to maintain). UTF-8 nicely works even with the native C char type, UTF-16 doesn't. In the end, all these considerations need to be weighed and your solution should accomodate what is best for your problem. In my case, all I want to say is that UTF-16 often turns out to be the better approach for text processing, while UTF-8 is useful for text storage and transmission.

wladimir · on Feb 8, 2011

Right, if you ensure that only ASCII characters are used in UTF-8, which you can check using 0x80 & byte == 0x00, counting the number of characters is easy. But to check that this condition holds, you need to iterate over the whole string anyway.

I don't see how this gives any advantage to either of the encodings. Both for UTF-8 and UTF-16 you have to implement some decoding to reliably count the number of characters.

fnl · on Feb 8, 2011

Oh, and another thing is your argument only holds for ASCII - I hardly ever encounter pure ASCII data nowadays when I work with text. On the other hand, I never have encountered a Surrogate Character except in my test libraries, either. Next, if you are working with pure ASCII, who cares about UTF-8 or -16? Last, you still need to scan every byte in UTF-8 for that, but only every second "two-byte" in UTF-16, which is half.

roel_v · on Feb 8, 2011

When you say 'ASCII' I guess you mean 'strings where all bytes have a decimal value in the range [0-127]', right? If so I agree that it's rare to encounter that, but the common use (however wrong) of ASCII is 'chars are in [0-255]', i.e. all chars are one byte; and that data is very common.

Thinking about it, I don't know what codepage the UTF-8 128-255 code points map too, if any, though; could you explain? If you treat UTF-8 as ASCII data (as one byte, one character, basically), does it generally work with chars in the [127-255] range.

fnl · on Feb 8, 2011

256 char ASCII is called "8-bit", "high", or "extended" ASCII. So, pure (7-bit) ASCII is the only thing you can hold in a "one-byte UTF-8 array". The 8th bit (or the first, depending on how you see it) is used to mark multibyte characters, i.e. any character that is not ASCII. So you only can represent 128 possible symbols into a UTF-8 character with length of one byte. In particular, UTF-8 maps these to ASCII (or, US-ASCII) characters, and the byte starts with a 0 bit (In other words, you cannot encode high ASCII into one UTF-8 byte.) For ALL other characters (no matter), the first bit in any (multi-)byte is set to 1. That's why it is easy to scan for length of ASCII chars in UTF-8, but not for any others.

roel_v · on Feb 8, 2011

Oh I see, thank you, it seems I was misinformed.

fnl · on Feb 8, 2011

The important fact lies in the last sentence; in all of my apps, I could so far safely and soundly ignore the fact that Surrogates are two characters long to establish character lengths. So by using UTF-16, all those apps are significantly faster than one that would have been based on UTF-8. And, to add to this, by the same nature, it is far easier to take char slices of a UTF-16 encoded byte array than of a UTF-8 one.

querulous · on Feb 9, 2011

how do you ensure you are not slicing a character in two? for example, how do you prevent slicing 2⁵ in half?

beagle3 · on Feb 8, 2011

I'm interested to know in what application you need character counting in which this makes a difference: How about diacritics and accents? Or do you care only about chars and not about semantics or graphics?

Also, if that mattered to me, I'd store the char count in addition to the string length.

querulous · on Feb 9, 2011

if your utf-16 is canonicalized, it's code point count and character count likely diverge anyways, particularly when dealing with scientific data.

udoprog · on Feb 8, 2011

"...Unicode has been standardising CJK character sets" - CJK characters sets are complex, and will always face opposition, besides, this is less related to how the characters are encoded, and more how to choose and composite your code points.

"I think the author has confused the fixed-length UCS-2 and the variable length UTF-16." - No, in the same sentence you are quoting he mentions that utf-16 is limited to 0x110000 codepoints, contra utf-8 that is specified to expand up to 6 bytes.

"Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width." - I won't be converting my source code into utf-16 any time soon. Besides, C is not a good example where this is an actual problem, the "unicode strings" will still just be represented as binary blobs - different depending on which encoding you choose. It's more of a problem in e.g. Python, where the native string is an array of characters, not just a null terminated blob. I wouldn't encode my source code in utf-8, however, if my editor my default happened to support utf-8, this wouldn't be an issue unless I enter an non-ascii compatible code point. I see this as a big plus.

chalst · on Feb 8, 2011

"confused ... UCS-2 and ... UTF-16" - You are quite right, he is clear on the difference. It's presumably possible to apply the same trick to UTF-16 as was applied to UCS-2, to expand further, but this is ad-hoc and not built in.

"supports 16-bit char width" - I mean the compiler has CHAR_BIT == 16; this is independent of the character encoding of the source.

lysium · on Feb 8, 2011

You don't mention issues and your advantages are only marginal for specific use cases such as Asian languages.

"UTF-16 is better": but 'only' for Asian languages; most code is still written in ASCII

"need to commit to 16-bit char width": the OP's point was that many many tools don't operate on 16-bit char width and with UTF-8 you can continue to use them.

The other two points are no advantages for UTF-16.

yuhong · on Feb 8, 2011

>"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.

Yea, it really should be "We lose backwards compatibility with code treating a zero 8-bit byte as a string terminator."

>"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.

For pure CJK text files, yes.