Broad backwards compatibility with ASCII is a strong reason to prefer UTF-8 in most applications, however I find the issues with UTF-16 are overstated and the advantages of it ignored.
"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.
"A related drawback is the loss of self-synchronization at the byte level." - Maybe this is a problem, maybe not. Maybe the failure of UTF-8 to be self-synchronising at the 4-bit level is a problem is some circumstances. I don't mean to be flippant, but the wider point is that with UTF-16, you really need to commit to 16-bit char width.
"The encoding is inflexible" - I think the author has confused the fixed-width UCS-2 and the variable-width UTF-16.
"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.
> "UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.
In some contexts that may matter. In other cases, you can expect enough ASCII mixed in to outweigh that effect. For example, on the Web though, if you're sending HTML, the savings from using 8 bits for the ASCII tags will nearly outweigh the cost of extra bytes for content text. Gzip shrinks that further to essentially no difference.
As I posted there, too, he also missed one very important plus in UTF-16: character counting. If you ensure there are no Surrogate Chars (which is a simple byte-mask comparison using every second byte (only!) as in: "0xD800 & byte == 0xD800" [using the entire range to avoid having to think about byte order]) in your array, you can be 100% sure about the character length just by dividing the byte-length in two. With UTF-8, you have to look at each and every byte one by one and count the number of characters to establish the actual char-length. Finally, given how rare the Supplementary Range characters are, the Surrogate Range check can be safely skipped in many scenarios, making it significantly faster to establish char lengths in UTF-16 than UTF-8.
EDIT: oh, and before any "semantic-nerd" comes along: I am fully aware that 0xXXXX are two bytes, so, if you want, read "two-byte" for every time I mention "byte" above... (doh ;))
If you can guarantee there are no surrogate bytes, then you're working with UCS-2 rather than UTF-16.
If you can guarantee there are no bytes with the high bit set (a simple byte mask comparison which can be done 32 bits at a time for speed), then UTF-8 devolves to ASCII and you can calculate character-length without any division at all!
Counting the number of characters in a string is a misleading benchmark, because it's not a thing that people need to do terribly often - if you're rendering text to a display, "character count" is not enough, you need to know things like font metrics and which characters are combining, zero-width, or double-width. Concatenating strings is easy, and only cares about byte length, not character length. And if you really, really find character-length calculations on your hot-path, you can always just store length-prefixed strings instead of terminated strings.
Quite on the contrary; I work in data mining and need to count characters on many different occasions. Probably for web development you don't need to care about that stuff, right (but read the second paragraph first!). So if character counting is not important in your app, you can naturally use whatever you want. If it is, however, UTF-16 is far more efficient, while not having to waste space with UTF-32 if you can be quite sure that the Supp. Range can be safely ignored or by adding a trivial safeguard.
And the "ASCII argument" is quite antiquated, almost all (scientific) char data I have contains non-ASCII characters, especially greek letters and such. Same goes even for web development or practically any user-oriented job where you need to accomodate for an international bunch of users. If you have pure ASCII data, lucky you, but why bother with UTF-anything then??
I do a lot of data mining and text mining as well. My biggest concern is always memory usage and UTF-8 is hugely more memory efficient than UTF-16 in almost all cases, even for Asian text. Any difference in character counting performance is negligible compared to the benefit of avoiding disk access.
Also, the text I process is never UTF-16 at the source. So even if I used UTF-16, I would have to convert the text first and that would be the one and only time characters are counted. There would be no additional counting overhead at all.
Re: char counting, as usual it depends on the application, but as the previous poster remarked you can keep a separate counter for O(1) length access, at least in situations where strings are relatively constant. I agree the author of the article omitted this point and that I personally am not as black and white in the 'UTF-8 is better' camp, but just string length is imo, usually, not an unsolvable bottleneck.
Well, I am neither black-and-white with the topic. Both have valid reasons for use; UTF-8 in general is more efficient in space for any use where ASCII characters are predominant. UTF-16 is more efficient if non-Latin-1 characters are predominant. UTF-8 has backwards compatibility to ASCII if that is important. UTF-16 makes it easier and faster to count characters, take slices of characters, and scan for characters (and, so far, I insist no valid argument why this is the case has ever been produced - all UTF-8 solutions are hacky and/or harder to maintain). UTF-8 nicely works even with the native C char type, UTF-16 doesn't. In the end, all these considerations need to be weighed and your solution should accomodate what is best for your problem. In my case, all I want to say is that UTF-16 often turns out to be the better approach for text processing, while UTF-8 is useful for text storage and transmission.
Right, if you ensure that only ASCII characters are used in UTF-8, which you can check using 0x80 & byte == 0x00, counting the number of characters is easy. But to check that this condition holds, you need to iterate over the whole string anyway.
I don't see how this gives any advantage to either of the encodings. Both for UTF-8 and UTF-16 you have to implement some decoding to reliably count the number of characters.
Oh, and another thing is your argument only holds for ASCII - I hardly ever encounter pure ASCII data nowadays when I work with text. On the other hand, I never have encountered a Surrogate Character except in my test libraries, either. Next, if you are working with pure ASCII, who cares about UTF-8 or -16? Last, you still need to scan every byte in UTF-8 for that, but only every second "two-byte" in UTF-16, which is half.
When you say 'ASCII' I guess you mean 'strings where all bytes have a decimal value in the range [0-127]', right? If so I agree that it's rare to encounter that, but the common use (however wrong) of ASCII is 'chars are in [0-255]', i.e. all chars are one byte; and that data is very common.
Thinking about it, I don't know what codepage the UTF-8 128-255 code points map too, if any, though; could you explain? If you treat UTF-8 as ASCII data (as one byte, one character, basically), does it generally work with chars in the [127-255] range.
256 char ASCII is called "8-bit", "high", or "extended" ASCII.
So, pure (7-bit) ASCII is the only thing you can hold in a "one-byte UTF-8 array". The 8th bit (or the first, depending on how you see it) is used to mark multibyte characters, i.e. any character that is not ASCII. So you only can represent 128 possible symbols into a UTF-8 character with length of one byte. In particular, UTF-8 maps these to ASCII (or, US-ASCII) characters, and the byte starts with a 0 bit (In other words, you cannot encode high ASCII into one UTF-8 byte.) For ALL other characters (no matter), the first bit in any (multi-)byte is set to 1. That's why it is easy to scan for length of ASCII chars in UTF-8, but not for any others.
The important fact lies in the last sentence; in all of my apps, I could so far safely and soundly ignore the fact that Surrogates are two characters long to establish character lengths. So by using UTF-16, all those apps are significantly faster than one that would have been based on UTF-8. And, to add to this, by the same nature, it is far easier to take char slices of a UTF-16 encoded byte array than of a UTF-8 one.
I'm interested to know in what application you need character counting in which this makes a difference: How about diacritics and accents? Or do you care only about chars and not about semantics or graphics?
Also, if that mattered to me, I'd store the char count in addition to the string length.
"...Unicode has been standardising CJK character sets" - CJK characters sets are complex, and will always face opposition, besides, this is less related to how the characters are encoded, and more how to choose and composite your code points.
"I think the author has confused the fixed-length UCS-2 and the variable length UTF-16." - No, in the same sentence you are quoting he mentions that utf-16 is limited to 0x110000 codepoints, contra utf-8 that is specified to expand up to 6 bytes.
"Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width." - I won't be converting my source code into utf-16 any time soon. Besides, C is not a good example where this is an actual problem, the "unicode strings" will still just be represented as binary blobs - different depending on which encoding you choose. It's more of a problem in e.g. Python, where the native string is an array of characters, not just a null terminated blob.
I wouldn't encode my source code in utf-8, however, if my editor my default happened to support utf-8, this wouldn't be an issue unless I enter an non-ascii compatible code point. I see this as a big plus.
"confused ... UCS-2 and ... UTF-16" - You are quite right, he is clear on the difference. It's presumably possible to apply the same trick to UTF-16 as was applied to UCS-2, to expand further, but this is ad-hoc and not built in.
"supports 16-bit char width" - I mean the compiler has CHAR_BIT == 16; this is independent of the character encoding of the source.
You don't mention issues and your advantages are only marginal for specific use cases such as Asian languages.
"UTF-16 is better": but 'only' for Asian languages; most code is still written in ASCII
"need to commit to 16-bit char width": the OP's point was that many many tools don't operate on 16-bit char width and with UTF-8 you can continue to use them.
The other two points are no advantages for UTF-16.
>"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.
Yea, it really should be "We lose backwards compatibility with code treating a zero 8-bit byte as a string terminator."
>"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.
Too few programmers know that UTF-8 is a variable length encoding: I've heard plenty assert that in UTF-8, every character takes two bytes, while claiming simultaneously they could encode every possible character in it.
A bit broader: too few programmers understand the difference between a character set and a character encoding.
That article starts out OK and then suddenly tries to argue that you can use the terms interchangeably. You can not and you will drown in confusion if you try to. Just imagine that tomorrow, the Chinese introduce their own character set next to Unicode, but use UTF-8 to minimize the number of bytes it takes to represent their language (which makes sense, because the frequency of characters drops off pretty fast and some characters are much more common than others, so you'd like to represent those with one byte).
The fact that the HTTP RFC speaks of 'charset=utf-8' is explained by this part of the spec:
Note: This use of the term "character set" is more commonly
referred to as a "character encoding." However, since HTTP and
MIME share the same registry, it is important that the terminology also be shared.
Why does MIME use the 'wrong' terminology? Perhaps because the registry is old and the difference between set and encoding was less obvious and relevant back then. Perhaps it was simply a mistake; a detail meant to be corrected. Perhaps the person that drew it up was inept. Who knows. It doesn't matter, it is still wrong. And don't get me started on the use of character set in MySql...
Unicode is a character set, and the only character set really worth speaking of. The Unicode character set includes almost every character in every writing system on Earth. A string is a piece of text, i.e. an ordered sequence of characters all taken from the same character set.
A character encoding is a mapping/function/algorithm/set of rules which can be used to convert a string into a sequence of bytes and back again.
A character set may have multiple encodings. UTF-8 and UTF-16 are two possible encodings of the Unicode character set.
Actually, it's just stored internally as UCS-2. I'm not sure why that really matters, though. You shouldn't care how your Unicode code points are stored (i.e., the size of the integer) as long as they encode to a UTF encoding correctly.
Edit: Ah, guys it's actually UTF-16, the configure flag is just named ucs2. False alarm.
I doubt that they are encoded in UCS-2 as that character set isn't able to encode every (or even just the majority) of unicode code points.
You are right though (and this is why I upvoted you back to 1) that you shouldn't care. In fact, you not knowing the internal encoding the proof of that. In python (I'm talking python 3 here which has done this right), you don't care how a string is stored internally.
The only place where you care about this is when your strings interact with the outside world (i/o). Then your strings need to be converted into bytes and thus the internal representation must be encoded using some kind of encoding.
This is what the .decode and .encode methods are used for.
In Python 2.x are encoded in UCS-2, not UTF-16, at least by default (I'm not sure about Python 3.x, I assume it's the same though). If you want to support every single possible Unicode codepoint, you can tell Python to do so at compile time (via ./configure flag).
In practice the characters that aren't in UCS-2 tend to be characters that don't exist in modern languages, e.g. the characterset for Linear B, Domino tiles, and Cuneiform, so they're not supported since they're not of practical use to most people. There's a fairly good list at http://en.wikipedia.org/wiki/Plane_(Unicode) . In this list, Python by default doesn't support things not in the BMP.
Things outside of the BMP aren't just dead languages anymore. You have to be able to support characters outside the BMP if you want to sell your software in China:
UTF-16 behaves to UCS-2 as UTF-8 does to ASCII. Meaning: They share the character set. UTF-16 extends UCS-2 by using some reserved characters to indicate that what is following should be interpreted according to UTF-16 rules. So just like UTF-8.
Meaning: Every UCS-2 document is also an UTF-16 document, but not the reverse (just like every ASCII document is also an UTF-8 document).
But as I said below: It doesn't matter and could even be a totally proprietary character set as long as pythons string operations work on that character set and as long as there's a way to decode input data into that set and encode output data from that set.
You should very much care about that, because if your tool stores text as UCS-2, it means that it doesn't support unicode at all, UCS-2 stopped being a valid encoding a long time ago.
You are completely right, I'm sorry about my previous comment.
The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:
>>> list(u'\U00010000')
[u'\ud800', u'\udc00']
If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...
No, that's the correct behavior. list only incidentally returns a single character in ASCII strings -- it's not required to. You shouldn't be using list on raw unicode strings.
u'\U00010000'.encode('utf-8')
should produce the same result on every Python version.
> You shouldn't be using list on raw unicode strings.
Why? I am using list only to show what are the values of s[0] and s[1].
What I am saying is that it returns the list of characters of the underlying representation, so a list of wide chars (possibly surrogate) if compiled with UTF16 or a list of 32bit characters if compiled with UTF16.
Are you suggesting that all the string processing (including iteration) should be done on a str encoded in UTF8 instead of using the native unicode type?
if you want to deal with characters with high numbers you should know code points stuff. for example, the String.length() would return a number of two-bytes chars, not real four bytes characters, which may confuse someone
Exactly. A Java char is not synonymous with a Unicode code point. But the majority of the time they are synonymous, older documentation claimed that they were the same, and this is the meme that many Java programmers (in my experience) have.
That's actually my point. Python supports Unicode code points and UTF. If you get the output encoding in UTF-8 it would actually be variable length chars. What's important is your coding output, not the internal code point representation.
It leaks through in some places. For example, len(u'\U0001D310') (from the Tai Xuan Jing Symbols) returns 1 on 32-bit wide pythons, and returns 2 on the default 16-bit wide builds.
The Windows NT development team made the decision to standardise on UTF-16. Every release of Windows since the original NT uses UTF-16 internally for all its "wide character" API calls (e.g., wcslen() and FindWindowW()).
I don't know about "preferring", but anyone manipulating strings in JavaScript is effectively using UTF-16 (or more precisely is using arrays of 16-bit integers which a web browser will interpret as UTF-16-encoded Unicode if you tell it that the array contains text).
As a consequence at least Gecko and Webkit both use UTF-16 for their string classes, though there has been talk of trying to switch Gecko over to UTF-8. The problem then would be implementing the JS string APIs on top of UTF-8 strings efficiently.
Strings are big-endian UTF-16 by default even in Cocoa (stored in an array of unsigned shorts). Worst of all GCC define the wchar_t as a 4 byte int unless you specify -fshort-wchar.
As far as I know, wchar_t is meant to be an internal only representation, so it's good that it is 32 bits--that way you are in one codepoint per word territory. It's a mistake to think you can just overlay some unicode binary data with a wchar_t pointer--you need to convert into and out of wchat_t from utf8/utf-16/whatever. Otherwise you aren't handling codepoints above 16 bits correctly.
This is a common misconception about UTF-8 vs. UTF-16. You're missing two important facts.
1. Most UTF-8 string operations can operate on a byte at a time. You just have realize that functions like strlen will be telling you byte length instead of character length, and this usually doesn't even matter. (It's still important to know.)
2. UTF-16 is still a variable-width encoding. It was originally intended to be fixed-width, but then the Unicode character set grew too large to be represented in 16 bits.
If I have no Surrogate Range CPs in a string, it is far easier to work with UTF-16 than UTF-8 at the byte level, because all chars are constant size. For UTF-8 that only applies to ASCII. And SRs characters are extraordinarily rare, while non-ASCII chars are extremely common. So my programs ensure at the entry points the string is UCS-2 compatible, and then all subsequent string manipulations are far less complex to handle than with UTF-8.
UTF-16 was never intended to be a fixed-width encoding and has been created in order to support characters outside the BMP which aren't covered by UCS-2.
Whatever environments jumped on Unicode early, before it was realized that 2 bytes wouldn't be enough, all chose to use UCS-2 for obvious reasons. In particular, that includes Windows and Java.
Probably because they figured they could just ignore endianness issues and that ASCII compatibility would be Somebody Else's Problem.
There were always problems with UCS-2. UTF-8 would have had a number of advantages over it even if Unicode had never grown beyond the BMP (Basic Multilingual Plane, the first and lowest-numbered 16-bit code space).
> ASCII compatibility would be Somebody Else's Problem
for many of those outside "A" in ASCII (euphemism for America :) there were already a ton of problems, so endianness was the least (i personally never hit this problem)
// disclaimer: i'm not that serious about predominance of Latin script, this is sorta irony
Depending on the level of abstraction you're living at - and that depends on the overall goal, performance constraints, environmental integration, OS / machine heterogeneity etc. - it may or may not be a problem.
It's easy to dismiss if you have all the time in the world and a deep stack of abstractions.
If you're doing deep packet analysis on UTF-16 text in a router, things may be different.
thanks, my question was right about the issues met by people living in another levels of abstractions.
i'm not a native english speaker and a newb to HN, so sorry that i put my sincere question so that it looked like arrogant statement 'there are no issues, what are you talking about, i even don't know what LE and BE mean'.
> or many of those outside "A" in ASCII (euphemism for America :)
Abbreviation for 'American', in fact. No euphemisms needed.
(ASCII = American Standard Code for Information Interchange)
> there were already a ton of problems, so endianness was the least
I can appreciate this. However, UTF-8 also has desirable properties like 'dropping a single byte only means you lose one character, as opposed to potentially losing the whole file', and 'you can often tell if a multi-byte UTF-8 sequence has been corrupted without doing complex analysis'.
> i'm not that serious about predominance of Latin script, this is sorta irony
Heh. ASCII can't even encode the entirety of the Latin script: Ask a Frenchman how he spells 'café', or a German how he spells 'straße', and notice how important characters are missing from ASCII.
I keep hoping a string API will catch on in which combining marks are mostly treated as indivisible. Handling text one codepoint at a time is as bad an idea as handling US-ASCII one bit at a time--almost everything it lets you do is an elaborate way to misinterpret or corrupt your data.
It's not so simple: it depends on what you're doing with the text. If you're not trying to do analysis with it, encoded text is more or less a program written in a DSL that, when interpreted by a font renderer, draws symbols in some graphical context. Depending on the analysis you want to do, you need varying amounts of knowledge. Perhaps you only need to know about word boundaries; perhaps you're trying to look things up in a normalized dictionary; maybe even decompose a word into phonemes to try and pronounce it. These require different levels of analysis, and one size won't fit all.
update: for particular purposes consider using Collator class, it makes collation keys (byte arrays) out of strings applying locale, case sensitiveness and unicode decomposition. (at least so says the doc, http://download.oracle.com/javase/6/docs/api/java/text/Colla... )
The article, at the end, claims that "ASCII was developed from telegraph codes by a committee." It turns out the story is much, much more complicated and interesting than that: http://www.wps.com/projects/codes/
UCS2, even though being an outdated predecessor to UTF-16, has some unique qualities that make it useful for things like databases or other storage mediums that you are not mixing with a lot of low code point characters (like you do with XML and HTML markups).
One being that's fair for all languages with respect to size so when you may be storing your standard Chinese, Korean, Japanese characters.
When UTF-16 made UC2 variable length, a few of the nice things were lost, but when dealing a lot of the higher code point characters mostly, UTF-16 may save you space.
sorry for the quastion as the reply, i would also ask for some info about the unicode issues in jython, as i lack experience with all python stuff. does it have problems or everything is transparent?
Correct me if I'm wrong, but I think Excel still outputs UTF-16 in some cases. I remember parsing generated .txt/.csv files and there were issues with it and it's endian order.
You seem to be confusing "fair" with "equal". Treating everybody the same is not necessarily fair. It seems fair to me to have the most common characters be shortest. I don't have any evidence, but I would guess that Latin characters[1] are used most commonly.
[1] "Latin characters" is the proper term, not "American characters"
thanks! but i just tried to kid which i can't control sometimes.
anyway, i think that even if Latin chars weren't the most used in the world, it would be fair to keep them the primary charset for use in programming and markup languages, as no-one now complains that the international language of medicine is Latin, not, say, Chinese :) as computers started to be massively developed in America.
"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.
"A related drawback is the loss of self-synchronization at the byte level." - Maybe this is a problem, maybe not. Maybe the failure of UTF-8 to be self-synchronising at the 4-bit level is a problem is some circumstances. I don't mean to be flippant, but the wider point is that with UTF-16, you really need to commit to 16-bit char width.
"The encoding is inflexible" - I think the author has confused the fixed-width UCS-2 and the variable-width UTF-16.
"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.