Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Java stores all strings in UTF-16. It probably seemed like a good idea at the time but has led to problems later. There is now a huge body of broken code out there which assumes that each 16-bit char value represents a complete and separate Unicode character.


Um. Wasn't Java's original choice UCS-2 (same as NTFS originally)? UTF-16 has surrogate characters and the length will be one or two 16-bit values depending on whether it's a surrogate. UCS-2 was a fixed 16-bit character size, long since deemed a mistake.


You're right. Java originally used UCS-2 and later switched to UTF-16. http://java.sun.com/docs/books/jls/first_edition/html/3.doc.... The problem is that dealing with surrogate pairs is so awkward that most developers don't even bother.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: