Except that Java strings *aren't* in Unicode. They're in UTF-16, which is the wo...

0x09 · on May 6, 2012

UTF-16 is most certainly Unicode, even if it isn't the preferred flavor.

slowpoke · on May 6, 2012

Wrong. UTF-16 is an encoding, Unicode is the abstract representation. You can encode Unicode strings into UTF-16, but that doesn't make UTF-16 Unicode. Python's Unicode strings are actually really just Unicode, that's why you can't write them to a file or anything - you need to encode them first (which defaults to UTF-8).

kami8845 · on May 6, 2012

Unfortunately it only defaults to utf-8 in Python 3

duskwuff · on May 7, 2012

In Java, you end up seeing the guts of UTF-16 far more often than you should. Most notably, the String APIs often index strings by UTF-16 code units, not characters, so string "lengths" don't always correspond to the number of Unicode characters, and you can end up cutting surrogate pairs in half if you aren't careful.

anuraj · on May 6, 2012

"UCS-2 (2-byte Universal Character Set) is a character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996". Java adopted UCS-2 which was later supplemented with UTF-16 support.

There are at least a few languages in this world which do not use roman letters and are better represented as multibyte sequences :). That is why Java added support for supplementary UTF-16 characters over and above UCS-2. That said, use of UTF-8 would have been optimal for western languages, but sub optimal for several other languages.