Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Whatever environments jumped on Unicode early, before it was realized that 2 bytes wouldn't be enough, all chose to use UCS-2 for obvious reasons. In particular, that includes Windows and Java.


Well, the Plan 9 guys famously got this right...


> all chose to use UCS-2 for obvious reasons

Probably because they figured they could just ignore endianness issues and that ASCII compatibility would be Somebody Else's Problem.

There were always problems with UCS-2. UTF-8 would have had a number of advantages over it even if Unicode had never grown beyond the BMP (Basic Multilingual Plane, the first and lowest-numbered 16-bit code space).


Note however that UTF-8 did not exist in the early days and UTF-1 sucked.


> endianness issues

what are the issues?

> ASCII compatibility would be Somebody Else's Problem

for many of those outside "A" in ASCII (euphemism for America :) there were already a ton of problems, so endianness was the least (i personally never hit this problem)

// disclaimer: i'm not that serious about predominance of Latin script, this is sorta irony


Endian: there's little-endian UTF-16LE and big-endian UTF-16BE, mirror images of one another.


i thought the word 'issue' meant 'a problem'...


Depending on the level of abstraction you're living at - and that depends on the overall goal, performance constraints, environmental integration, OS / machine heterogeneity etc. - it may or may not be a problem.

It's easy to dismiss if you have all the time in the world and a deep stack of abstractions.

If you're doing deep packet analysis on UTF-16 text in a router, things may be different.


thanks, my question was right about the issues met by people living in another levels of abstractions.

i'm not a native english speaker and a newb to HN, so sorry that i put my sincere question so that it looked like arrogant statement 'there are no issues, what are you talking about, i even don't know what LE and BE mean'.

will learn.


If you're not sending UTF-16 text across the wire in network byte order, you are already in a world of pain.


> or many of those outside "A" in ASCII (euphemism for America :)

Abbreviation for 'American', in fact. No euphemisms needed.

(ASCII = American Standard Code for Information Interchange)

> there were already a ton of problems, so endianness was the least

I can appreciate this. However, UTF-8 also has desirable properties like 'dropping a single byte only means you lose one character, as opposed to potentially losing the whole file', and 'you can often tell if a multi-byte UTF-8 sequence has been corrupted without doing complex analysis'.

> i'm not that serious about predominance of Latin script, this is sorta irony

Heh. ASCII can't even encode the entirety of the Latin script: Ask a Frenchman how he spells 'café', or a German how he spells 'straße', and notice how important characters are missing from ASCII.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: