UTF-8 good, UTF-16 bad

chalst · on Feb 8, 2011

Broad backwards compatibility with ASCII is a strong reason to prefer UTF-8 in most applications, however I find the issues with UTF-16 are overstated and the advantages of it ignored.

"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.

"A related drawback is the loss of self-synchronization at the byte level." - Maybe this is a problem, maybe not. Maybe the failure of UTF-8 to be self-synchronising at the 4-bit level is a problem is some circumstances. I don't mean to be flippant, but the wider point is that with UTF-16, you really need to commit to 16-bit char width.

"The encoding is inflexible" - I think the author has confused the fixed-width UCS-2 and the variable-width UTF-16.

"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.

othermaciej · on Feb 8, 2011

> "UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.

In some contexts that may matter. In other cases, you can expect enough ASCII mixed in to outweigh that effect. For example, on the Web though, if you're sending HTML, the savings from using 8 bits for the ASCII tags will nearly outweigh the cost of extra bytes for content text. Gzip shrinks that further to essentially no difference.

fnl · on Feb 8, 2011

As I posted there, too, he also missed one very important plus in UTF-16: character counting. If you ensure there are no Surrogate Chars (which is a simple byte-mask comparison using every second byte (only!) as in: "0xD800 & byte == 0xD800" [using the entire range to avoid having to think about byte order]) in your array, you can be 100% sure about the character length just by dividing the byte-length in two. With UTF-8, you have to look at each and every byte one by one and count the number of characters to establish the actual char-length. Finally, given how rare the Supplementary Range characters are, the Surrogate Range check can be safely skipped in many scenarios, making it significantly faster to establish char lengths in UTF-16 than UTF-8.

EDIT: oh, and before any "semantic-nerd" comes along: I am fully aware that 0xXXXX are two bytes, so, if you want, read "two-byte" for every time I mention "byte" above... (doh ;))

thristian · on Feb 8, 2011

If you can guarantee there are no surrogate bytes, then you're working with UCS-2 rather than UTF-16.

If you can guarantee there are no bytes with the high bit set (a simple byte mask comparison which can be done 32 bits at a time for speed), then UTF-8 devolves to ASCII and you can calculate character-length without any division at all!

Counting the number of characters in a string is a misleading benchmark, because it's not a thing that people need to do terribly often - if you're rendering text to a display, "character count" is not enough, you need to know things like font metrics and which characters are combining, zero-width, or double-width. Concatenating strings is easy, and only cares about byte length, not character length. And if you really, really find character-length calculations on your hot-path, you can always just store length-prefixed strings instead of terminated strings.

fnl · on Feb 8, 2011

Quite on the contrary; I work in data mining and need to count characters on many different occasions. Probably for web development you don't need to care about that stuff, right (but read the second paragraph first!). So if character counting is not important in your app, you can naturally use whatever you want. If it is, however, UTF-16 is far more efficient, while not having to waste space with UTF-32 if you can be quite sure that the Supp. Range can be safely ignored or by adding a trivial safeguard.

And the "ASCII argument" is quite antiquated, almost all (scientific) char data I have contains non-ASCII characters, especially greek letters and such. Same goes even for web development or practically any user-oriented job where you need to accomodate for an international bunch of users. If you have pure ASCII data, lucky you, but why bother with UTF-anything then??

fauigerzigerk · on Feb 8, 2011

I do a lot of data mining and text mining as well. My biggest concern is always memory usage and UTF-8 is hugely more memory efficient than UTF-16 in almost all cases, even for Asian text. Any difference in character counting performance is negligible compared to the benefit of avoiding disk access.

Also, the text I process is never UTF-16 at the source. So even if I used UTF-16, I would have to convert the text first and that would be the one and only time characters are counted. There would be no additional counting overhead at all.

roel_v · on Feb 8, 2011

Re: char counting, as usual it depends on the application, but as the previous poster remarked you can keep a separate counter for O(1) length access, at least in situations where strings are relatively constant. I agree the author of the article omitted this point and that I personally am not as black and white in the 'UTF-8 is better' camp, but just string length is imo, usually, not an unsolvable bottleneck.

fnl · on Feb 8, 2011

Well, I am neither black-and-white with the topic. Both have valid reasons for use; UTF-8 in general is more efficient in space for any use where ASCII characters are predominant. UTF-16 is more efficient if non-Latin-1 characters are predominant. UTF-8 has backwards compatibility to ASCII if that is important. UTF-16 makes it easier and faster to count characters, take slices of characters, and scan for characters (and, so far, I insist no valid argument why this is the case has ever been produced - all UTF-8 solutions are hacky and/or harder to maintain). UTF-8 nicely works even with the native C char type, UTF-16 doesn't. In the end, all these considerations need to be weighed and your solution should accomodate what is best for your problem. In my case, all I want to say is that UTF-16 often turns out to be the better approach for text processing, while UTF-8 is useful for text storage and transmission.

wladimir · on Feb 8, 2011

Right, if you ensure that only ASCII characters are used in UTF-8, which you can check using 0x80 & byte == 0x00, counting the number of characters is easy. But to check that this condition holds, you need to iterate over the whole string anyway.

I don't see how this gives any advantage to either of the encodings. Both for UTF-8 and UTF-16 you have to implement some decoding to reliably count the number of characters.

fnl · on Feb 8, 2011

Oh, and another thing is your argument only holds for ASCII - I hardly ever encounter pure ASCII data nowadays when I work with text. On the other hand, I never have encountered a Surrogate Character except in my test libraries, either. Next, if you are working with pure ASCII, who cares about UTF-8 or -16? Last, you still need to scan every byte in UTF-8 for that, but only every second "two-byte" in UTF-16, which is half.

roel_v · on Feb 8, 2011

When you say 'ASCII' I guess you mean 'strings where all bytes have a decimal value in the range [0-127]', right? If so I agree that it's rare to encounter that, but the common use (however wrong) of ASCII is 'chars are in [0-255]', i.e. all chars are one byte; and that data is very common.

Thinking about it, I don't know what codepage the UTF-8 128-255 code points map too, if any, though; could you explain? If you treat UTF-8 as ASCII data (as one byte, one character, basically), does it generally work with chars in the [127-255] range.

fnl · on Feb 8, 2011

256 char ASCII is called "8-bit", "high", or "extended" ASCII. So, pure (7-bit) ASCII is the only thing you can hold in a "one-byte UTF-8 array". The 8th bit (or the first, depending on how you see it) is used to mark multibyte characters, i.e. any character that is not ASCII. So you only can represent 128 possible symbols into a UTF-8 character with length of one byte. In particular, UTF-8 maps these to ASCII (or, US-ASCII) characters, and the byte starts with a 0 bit (In other words, you cannot encode high ASCII into one UTF-8 byte.) For ALL other characters (no matter), the first bit in any (multi-)byte is set to 1. That's why it is easy to scan for length of ASCII chars in UTF-8, but not for any others.

roel_v · on Feb 8, 2011

Oh I see, thank you, it seems I was misinformed.

fnl · on Feb 8, 2011

The important fact lies in the last sentence; in all of my apps, I could so far safely and soundly ignore the fact that Surrogates are two characters long to establish character lengths. So by using UTF-16, all those apps are significantly faster than one that would have been based on UTF-8. And, to add to this, by the same nature, it is far easier to take char slices of a UTF-16 encoded byte array than of a UTF-8 one.

querulous · on Feb 9, 2011

how do you ensure you are not slicing a character in two? for example, how do you prevent slicing 2⁵ in half?

beagle3 · on Feb 8, 2011

I'm interested to know in what application you need character counting in which this makes a difference: How about diacritics and accents? Or do you care only about chars and not about semantics or graphics?

Also, if that mattered to me, I'd store the char count in addition to the string length.

querulous · on Feb 9, 2011

if your utf-16 is canonicalized, it's code point count and character count likely diverge anyways, particularly when dealing with scientific data.

udoprog · on Feb 8, 2011

"...Unicode has been standardising CJK character sets" - CJK characters sets are complex, and will always face opposition, besides, this is less related to how the characters are encoded, and more how to choose and composite your code points.

"I think the author has confused the fixed-length UCS-2 and the variable length UTF-16." - No, in the same sentence you are quoting he mentions that utf-16 is limited to 0x110000 codepoints, contra utf-8 that is specified to expand up to 6 bytes.

"Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width." - I won't be converting my source code into utf-16 any time soon. Besides, C is not a good example where this is an actual problem, the "unicode strings" will still just be represented as binary blobs - different depending on which encoding you choose. It's more of a problem in e.g. Python, where the native string is an array of characters, not just a null terminated blob. I wouldn't encode my source code in utf-8, however, if my editor my default happened to support utf-8, this wouldn't be an issue unless I enter an non-ascii compatible code point. I see this as a big plus.

chalst · on Feb 8, 2011

"confused ... UCS-2 and ... UTF-16" - You are quite right, he is clear on the difference. It's presumably possible to apply the same trick to UTF-16 as was applied to UCS-2, to expand further, but this is ad-hoc and not built in.

"supports 16-bit char width" - I mean the compiler has CHAR_BIT == 16; this is independent of the character encoding of the source.

lysium · on Feb 8, 2011

You don't mention issues and your advantages are only marginal for specific use cases such as Asian languages.

"UTF-16 is better": but 'only' for Asian languages; most code is still written in ASCII

"need to commit to 16-bit char width": the OP's point was that many many tools don't operate on 16-bit char width and with UTF-8 you can continue to use them.

The other two points are no advantages for UTF-16.

yuhong · on Feb 8, 2011

>"We lose backwards compatibility with code treating NUL as a string terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler that supports 16-bit char width.

Yea, it really should be "We lose backwards compatibility with code treating a zero 8-bit byte as a string terminator."

>"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e., quite a bit more compact, since, e.g., Chinese characters always take 3 characters in UTF-8, for asian languages.

For pure CJK text files, yes.

wooster · on Feb 8, 2011

Better discussion here, IMO: http://research.swtch.com/2010/03/utf-8-bits-bytes-and-benef...

adobriyan · on Feb 8, 2011

> The Go designers knew what they were doing, so they chose UTF-8.

The authors of Go and authors of UTF-8 are more or less the same people, so the choice was no-brainer.

burgerbrain · on Feb 8, 2011

I honestly had no idea there were parts of the development community that actually used and preferred UTF-16...

btilly · on Feb 8, 2011

Anyone who programs in Java is using UTF-16. And in my experience very few Java programmers understand that UTF-16 is a variable length encoding.

Confusion · on Feb 8, 2011

Too few programmers know that UTF-8 is a variable length encoding: I've heard plenty assert that in UTF-8, every character takes two bytes, while claiming simultaneously they could encode every possible character in it.

A bit broader: too few programmers understand the difference between a character set and a character encoding.

fedd · on Feb 8, 2011

> too few programmers understand the difference between a character set and a character encoding

why then they have the same names??? :)))

ps/ http://www.grauw.nl/blog/entry/254 - is this article ok? first in google by "charset encoding difference"

Confusion · on Feb 8, 2011

That article starts out OK and then suddenly tries to argue that you can use the terms interchangeably. You can not and you will drown in confusion if you try to. Just imagine that tomorrow, the Chinese introduce their own character set next to Unicode, but use UTF-8 to minimize the number of bytes it takes to represent their language (which makes sense, because the frequency of characters drops off pretty fast and some characters are much more common than others, so you'd like to represent those with one byte).

The fact that the HTTP RFC speaks of 'charset=utf-8' is explained by this part of the spec:

  Note: This use of the term "character set" is more commonly
  referred to as a "character encoding." However, since HTTP and
  MIME share the same registry, it is important that the terminology also be shared.

Why does MIME use the 'wrong' terminology? Perhaps because the registry is old and the difference between set and encoding was less obvious and relevant back then. Perhaps it was simply a mistake; a detail meant to be corrected. Perhaps the person that drew it up was inept. Who knows. It doesn't matter, it is still wrong. And don't get me started on the use of character set in MySql...

qntm · on Feb 8, 2011

Unicode is a character set, and the only character set really worth speaking of. The Unicode character set includes almost every character in every writing system on Earth. A string is a piece of text, i.e. an ordered sequence of characters all taken from the same character set.

A character encoding is a mapping/function/algorithm/set of rules which can be used to convert a string into a sequence of bytes and back again.

A character set may have multiple encodings. UTF-8 and UTF-16 are two possible encodings of the Unicode character set.

machrider · on Feb 8, 2011

Python, too (it can be compiled to use UTF-32, but normally uses UTF-16 to mean "Unicode").

Locke1689 · on Feb 8, 2011

Actually, it's just stored internally as UCS-2. I'm not sure why that really matters, though. You shouldn't care how your Unicode code points are stored (i.e., the size of the integer) as long as they encode to a UTF encoding correctly.

Edit: Ah, guys it's actually UTF-16, the configure flag is just named ucs2. False alarm.

pilif · on Feb 8, 2011

I doubt that they are encoded in UCS-2 as that character set isn't able to encode every (or even just the majority) of unicode code points.

You are right though (and this is why I upvoted you back to 1) that you shouldn't care. In fact, you not knowing the internal encoding the proof of that. In python (I'm talking python 3 here which has done this right), you don't care how a string is stored internally.

The only place where you care about this is when your strings interact with the outside world (i/o). Then your strings need to be converted into bytes and thus the internal representation must be encoded using some kind of encoding.

This is what the .decode and .encode methods are used for.

Have a look at http://diveintopython3.org/strings.html which manages to say this better (and with more words) than I ever would be able to.

eklitzke · on Feb 8, 2011

In Python 2.x are encoded in UCS-2, not UTF-16, at least by default (I'm not sure about Python 3.x, I assume it's the same though). If you want to support every single possible Unicode codepoint, you can tell Python to do so at compile time (via ./configure flag).

In practice the characters that aren't in UCS-2 tend to be characters that don't exist in modern languages, e.g. the characterset for Linear B, Domino tiles, and Cuneiform, so they're not supported since they're not of practical use to most people. There's a fairly good list at http://en.wikipedia.org/wiki/Plane_(Unicode) . In this list, Python by default doesn't support things not in the BMP.

Locke1689 · on Feb 8, 2011

No, the Python internals support surrogates so you can support characters outside the BMP. This makes it (basically) UTF-16.

sedachv · on Feb 9, 2011

Things outside of the BMP aren't just dead languages anymore. You have to be able to support characters outside the BMP if you want to sell your software in China:

http://en.wikipedia.org/wiki/GB_18030

pilif · on Feb 8, 2011

UTF-16 behaves to UCS-2 as UTF-8 does to ASCII. Meaning: They share the character set. UTF-16 extends UCS-2 by using some reserved characters to indicate that what is following should be interpreted according to UTF-16 rules. So just like UTF-8.

Meaning: Every UCS-2 document is also an UTF-16 document, but not the reverse (just like every ASCII document is also an UTF-8 document).

But as I said below: It doesn't matter and could even be a totally proprietary character set as long as pythons string operations work on that character set and as long as there's a way to decode input data into that set and encode output data from that set.

fhars · on Feb 8, 2011

You should very much care about that, because if your tool stores text as UCS-2, it means that it doesn't support unicode at all, UCS-2 stopped being a valid encoding a long time ago.

Locke1689 · on Feb 8, 2011

As the parent noted, it can be compiled for UTF-32 support. Just recompile if you need the extra characters.

Edit: Also, turns out it's UTF-16. The configure flag is named ucs2.

ot · on Feb 8, 2011

No, it is really UCS-2:

  >>> unichr(0x10000)
  ------------------------------------------------------------
  Traceback (most recent call last):
    File "<ipython console>", line 1, in <module>
  ValueError: unichr() arg not in range(0x10000) (narrow Python build)

If you want to support codepoints greater than 0x10000 you have to recompile with the option UTF32.

I think it must be a constant-lenght encoding to allow s[i] to be constant time.

Locke1689 · on Feb 8, 2011

Guido has a different opinion: http://mail.python.org/pipermail/python-dev/2008-July/080895...

ot · on Feb 8, 2011

You are completely right, I'm sorry about my previous comment.

The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:

  >>> list(u'\U00010000')
  [u'\ud800', u'\udc00']

If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...

Locke1689 · on Feb 8, 2011

No, that's the correct behavior. list only incidentally returns a single character in ASCII strings -- it's not required to. You shouldn't be using list on raw unicode strings.

  u'\U00010000'.encode('utf-8')

should produce the same result on every Python version.

ot · on Feb 9, 2011

> You shouldn't be using list on raw unicode strings.

Why? I am using list only to show what are the values of s[0] and s[1].

What I am saying is that it returns the list of characters of the underlying representation, so a list of wide chars (possibly surrogate) if compiled with UTF16 or a list of 32bit characters if compiled with UTF16.

Are you suggesting that all the string processing (including iteration) should be done on a str encoded in UTF8 instead of using the native unicode type?

fedd · on Feb 8, 2011

if you want to deal with characters with high numbers you should know code points stuff. for example, the String.length() would return a number of two-bytes chars, not real four bytes characters, which may confuse someone

//edit: this is about Java

btilly · on Feb 8, 2011

Exactly. A Java char is not synonymous with a Unicode code point. But the majority of the time they are synonymous, older documentation claimed that they were the same, and this is the meme that many Java programmers (in my experience) have.

fedd · on Feb 8, 2011

yes. i write my java-based matrix to be code-points aware so that no-one in Japan and China using it would face any problems.

Locke1689 · on Feb 8, 2011

That's actually my point. Python supports Unicode code points and UTF. If you get the output encoding in UTF-8 it would actually be variable length chars. What's important is your coding output, not the internal code point representation.

pieter · on Feb 8, 2011

It leaks through in some places. For example, len(u'\U0001D310') (from the Tai Xuan Jing Symbols) returns 1 on 32-bit wide pythons, and returns 2 on the default 16-bit wide builds.

Locke1689 · on Feb 8, 2011

Nope, that's the correct behavior. Run len on the UTF encode and you'll get the expected result.

getsat · on Feb 8, 2011

The Windows NT development team made the decision to standardise on UTF-16. Every release of Windows since the original NT uses UTF-16 internally for all its "wide character" API calls (e.g., wcslen() and FindWindowW()).

sedachv · on Feb 9, 2011

Windows has only had UTF-16 interfaces since 2000, NT was based around UCS-2.

getsat · on Feb 10, 2011

Hmm. I recall the SysInternals books saying it was NT4, not 5. I could be mistaken, though.

bzbarsky · on Feb 8, 2011

I don't know about "preferring", but anyone manipulating strings in JavaScript is effectively using UTF-16 (or more precisely is using arrays of 16-bit integers which a web browser will interpret as UTF-16-encoded Unicode if you tell it that the array contains text).

As a consequence at least Gecko and Webkit both use UTF-16 for their string classes, though there has been talk of trying to switch Gecko over to UTF-8. The problem then would be implementing the JS string APIs on top of UTF-8 strings efficiently.

program · on Feb 8, 2011

Strings are big-endian UTF-16 by default even in Cocoa (stored in an array of unsigned shorts). Worst of all GCC define the wchar_t as a 4 byte int unless you specify -fshort-wchar.

__david__ · on Feb 8, 2011

As far as I know, wchar_t is meant to be an internal only representation, so it's good that it is 32 bits--that way you are in one codepoint per word territory. It's a mistake to think you can just overlay some unicode binary data with a wchar_t pointer--you need to convert into and out of wchat_t from utf8/utf-16/whatever. Otherwise you aren't handling codepoints above 16 bits correctly.

fauigerzigerk · on Feb 8, 2011

Unfortunately the size of wchar_t is platform/compiler dependent.

fedd · on Feb 8, 2011

i guess utf-8 requires more computations while processing strings. if you work with just 2 bytes chars, it (might) work faster

// upvoted all replies, you're right

edsrzf · on Feb 8, 2011

This is a common misconception about UTF-8 vs. UTF-16. You're missing two important facts.

1. Most UTF-8 string operations can operate on a byte at a time. You just have realize that functions like strlen will be telling you byte length instead of character length, and this usually doesn't even matter. (It's still important to know.)

2. UTF-16 is still a variable-width encoding. It was originally intended to be fixed-width, but then the Unicode character set grew too large to be represented in 16 bits.

fnl · on Feb 8, 2011

If I have no Surrogate Range CPs in a string, it is far easier to work with UTF-16 than UTF-8 at the byte level, because all chars are constant size. For UTF-8 that only applies to ASCII. And SRs characters are extraordinarily rare, while non-ASCII chars are extremely common. So my programs ensure at the entry points the string is UCS-2 compatible, and then all subsequent string manipulations are far less complex to handle than with UTF-8.

program · on Feb 8, 2011

UTF-16 was never intended to be a fixed-width encoding and has been created in order to support characters outside the BMP which aren't covered by UCS-2.

edsrzf · on Feb 8, 2011

Okay, you're right. I shortened what I was trying to say too much. UTF-16 evolved from UCS-2, so its roots are in a 16-bit, fixed-width encoding.

lysium · on Feb 8, 2011

But UTF-16 is variable encoding, too?

sid0 · on Feb 8, 2011

Whatever environments jumped on Unicode early, before it was realized that 2 bytes wouldn't be enough, all chose to use UCS-2 for obvious reasons. In particular, that includes Windows and Java.

JoachimSchipper · on Feb 8, 2011

Well, the Plan 9 guys famously got this right...

derleth · on Feb 8, 2011

> all chose to use UCS-2 for obvious reasons

Probably because they figured they could just ignore endianness issues and that ASCII compatibility would be Somebody Else's Problem.

There were always problems with UCS-2. UTF-8 would have had a number of advantages over it even if Unicode had never grown beyond the BMP (Basic Multilingual Plane, the first and lowest-numbered 16-bit code space).

yuhong · on Feb 8, 2011

Note however that UTF-8 did not exist in the early days and UTF-1 sucked.

fedd · on Feb 8, 2011

> endianness issues

what are the issues?

> ASCII compatibility would be Somebody Else's Problem

for many of those outside "A" in ASCII (euphemism for America :) there were already a ton of problems, so endianness was the least (i personally never hit this problem)

// disclaimer: i'm not that serious about predominance of Latin script, this is sorta irony

barrkel · on Feb 8, 2011

Endian: there's little-endian UTF-16LE and big-endian UTF-16BE, mirror images of one another.

fedd · on Feb 8, 2011

i thought the word 'issue' meant 'a problem'...

barrkel · on Feb 8, 2011

Depending on the level of abstraction you're living at - and that depends on the overall goal, performance constraints, environmental integration, OS / machine heterogeneity etc. - it may or may not be a problem.

It's easy to dismiss if you have all the time in the world and a deep stack of abstractions.

If you're doing deep packet analysis on UTF-16 text in a router, things may be different.

fedd · on Feb 8, 2011

thanks, my question was right about the issues met by people living in another levels of abstractions.

i'm not a native english speaker and a newb to HN, so sorry that i put my sincere question so that it looked like arrogant statement 'there are no issues, what are you talking about, i even don't know what LE and BE mean'.

will learn.

peapicker · on Feb 8, 2011

If you're not sending UTF-16 text across the wire in network byte order, you are already in a world of pain.

derleth · on Feb 8, 2011

> or many of those outside "A" in ASCII (euphemism for America :)

Abbreviation for 'American', in fact. No euphemisms needed.

(ASCII = American Standard Code for Information Interchange)

> there were already a ton of problems, so endianness was the least

I can appreciate this. However, UTF-8 also has desirable properties like 'dropping a single byte only means you lose one character, as opposed to potentially losing the whole file', and 'you can often tell if a multi-byte UTF-8 sequence has been corrupted without doing complex analysis'.

> i'm not that serious about predominance of Latin script, this is sorta irony

Heh. ASCII can't even encode the entirety of the Latin script: Ask a Frenchman how he spells 'café', or a German how he spells 'straße', and notice how important characters are missing from ASCII.

prodigal_erik · on Feb 8, 2011

I keep hoping a string API will catch on in which combining marks are mostly treated as indivisible. Handling text one codepoint at a time is as bad an idea as handling US-ASCII one bit at a time--almost everything it lets you do is an elaborate way to misinterpret or corrupt your data.

barrkel · on Feb 8, 2011

It's not so simple: it depends on what you're doing with the text. If you're not trying to do analysis with it, encoded text is more or less a program written in a DSL that, when interpreted by a font renderer, draws symbols in some graphical context. Depending on the analysis you want to do, you need varying amounts of knowledge. Perhaps you only need to know about word boundaries; perhaps you're trying to look things up in a normalized dictionary; maybe even decompose a word into phonemes to try and pronounce it. These require different levels of analysis, and one size won't fit all.

fedd · on Feb 8, 2011

did you mean the situation when for example "ä" can be transmitted as 00e4 alone or as 0061 "a" + 0308 "umlaut"?

fedd · on Feb 8, 2011

look what i found and now plan to use

http://download.oracle.com/javase/6/docs/api/java/text/Norma...

update: for particular purposes consider using Collator class, it makes collation keys (byte arrays) out of strings applying locale, case sensitiveness and unicode decomposition. (at least so says the doc, http://download.oracle.com/javase/6/docs/api/java/text/Colla... )

thristian · on Feb 8, 2011

The article, at the end, claims that "ASCII was developed from telegraph codes by a committee." It turns out the story is much, much more complicated and interesting than that: http://www.wps.com/projects/codes/

zbowling · on Feb 8, 2011

UCS2, even though being an outdated predecessor to UTF-16, has some unique qualities that make it useful for things like databases or other storage mediums that you are not mixing with a lot of low code point characters (like you do with XML and HTML markups).

One being that's fair for all languages with respect to size so when you may be storing your standard Chinese, Korean, Japanese characters.

When UTF-16 made UC2 variable length, a few of the nice things were lost, but when dealing a lot of the higher code point characters mostly, UTF-16 may save you space.

fedd · on Feb 8, 2011

may i dare make a conclusion with my observations? :)

utf-8 is good for network interchange and is de facto becoming standard.

utf-16 is not bad for internal storage of strings in memory or in database. not nesserily bad. maybe even better for some reasons

natmaster · on Feb 8, 2011

Can someone link me to where python uses UTF-16? It was my understanding it defaulted to UTF-8.

http://www.python.org/dev/peps/pep-3120/

lysium · on Feb 8, 2011

That PEP only refers to the encoding of the source (code) file, not to the encoding on how strings are stored in Python.

Edit: Your question might be answered at SO: http://stackoverflow.com/questions/1838170/what-is-internal-...

masklinn · on Feb 8, 2011

> It was my understanding it defaulted to UTF-8.

For external IO, internally it's in UTF-16 (actually UCS-2, mostly unaware of surrogates) or in UCS-4 (via a compile-time switch).

See http://docs.python.org/c-api/unicode.html#Py_UNICODE for the Py_UNICODE API.

fedd · on Feb 8, 2011

sorry for the quastion as the reply, i would also ask for some info about the unicode issues in jython, as i lack experience with all python stuff. does it have problems or everything is transparent?

fnl · on Feb 8, 2011

check PEP-261

wildmXranat · on Feb 8, 2011

Correct me if I'm wrong, but I think Excel still outputs UTF-16 in some cases. I remember parsing generated .txt/.csv files and there were issues with it and it's endian order.

mikecaron · on Feb 8, 2011

Good read! I always HATE it when things complain that my visual studio .c/.h files are BINARY! WTF!

fedd · on Feb 8, 2011

and i find it fair that American characters require 2 bytes in Java as everybody else, not 1 as in utf-8! :)

brownleej · on Feb 8, 2011

You seem to be confusing "fair" with "equal". Treating everybody the same is not necessarily fair. It seems fair to me to have the most common characters be shortest. I don't have any evidence, but I would guess that Latin characters[1] are used most commonly.

[1] "Latin characters" is the proper term, not "American characters"

fedd · on Feb 9, 2011

thanks! but i just tried to kid which i can't control sometimes.

anyway, i think that even if Latin chars weren't the most used in the world, it would be fair to keep them the primary charset for use in programming and markup languages, as no-one now complains that the international language of medicine is Latin, not, say, Chinese :) as computers started to be massively developed in America.