Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Except that Java strings aren't in Unicode. They're in UTF-16, which is the worst-of-all-worlds encoding. (It's big and heavy, and it still has multibyte sequences. They're just rare enough that you're likely to forget about them during testing.)


UTF-16 is most certainly Unicode, even if it isn't the preferred flavor.


Wrong. UTF-16 is an encoding, Unicode is the abstract representation. You can encode Unicode strings into UTF-16, but that doesn't make UTF-16 Unicode. Python's Unicode strings are actually really just Unicode, that's why you can't write them to a file or anything - you need to encode them first (which defaults to UTF-8).


Unfortunately it only defaults to utf-8 in Python 3


In Java, you end up seeing the guts of UTF-16 far more often than you should. Most notably, the String APIs often index strings by UTF-16 code units, not characters, so string "lengths" don't always correspond to the number of Unicode characters, and you can end up cutting surrogate pairs in half if you aren't careful.


"UCS-2 (2-byte Universal Character Set) is a character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996". Java adopted UCS-2 which was later supplemented with UTF-16 support.

There are at least a few languages in this world which do not use roman letters and are better represented as multibyte sequences :). That is why Java added support for supplementary UTF-16 characters over and above UCS-2. That said, use of UTF-8 would have been optimal for western languages, but sub optimal for several other languages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: