Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What you're saying is true (although I'm not sure about 3, all the Japanese people I've talked to are annoyed by that), but it really sucks that we're dealing with problems that were supposed to be fixed by Unicode.

Han unification have been a huge mistake, to save a few thousands of characters, and now we keep piling on more and more stupid emoji.



  > Han unification have been a huge mistake, to save a few thousands
  > of characters, and now we keep piling on more and more stupid emoji.
This is my exact problem with Unicode. I've very grateful for the efforts that they have made in the past, but the change from "spare valuable codepoints at the expense of causing ambiguity in text" to "assign a new codepoint to every cartoon permutation of intangible nouns" is infuriating.


Han unification and Emojis are from a different time. We're talking about a decision made about 30 years ago in the early 90s; Unicode was 16 bits (65k total codepoints) at the time.

The original CJK Unified Ideographs block from 1992 consists of 21k codepoints. It was impossible to do that four times since 4×21k is more than 65k, and we're going to need space for some other languages as well. Why not make Unicode larger? Well, size was a real concern back then (still is, to some degree, but less so).

Since then Unicode has expanded and now we have slightly under 1 million codepoints. Han character blocks have extended to about 93k codepoints today, and 4 times ~93k codepoints is actually feasible. But now you run in to compatibility issues: you can't remove all the old Han unification stuff (it will break text, big no-no), so you need to re-define it all anew. Is that better? How about mixing "old" Han unified codepoints with new Japanese or Chinese stuff? Will it really improve things or just cause endless confusion (see: combining characters)?

For scale, all of Unicode currently defines about 145k codepoints; so even with Han unification we're talking about two thirds being taken up by just these three languages.

In comparison there are currently about 3,000 emojis, although the number of codepoints is much less since many codepoints are re-used (e.g. "firefighter" is "person + firetruck", flags use the country code, etc.). In a quick check it looks like there are about 1,000 to 1,500 codepoints reserved for emojis. In comparison, this is nothing.

What I'm trying to say is that the (comparatively) very low number of emojis has absolutely no bearing on this and that going off on a tangent about it is very misplaced.


I have no problem with 2/3 of the codepoints being taken up by 3 languages. Right now we (rightly) bend over backwards to accompany handicapped users, often tripling or quadrupling our QA. CJK users are much more common than handicapped users, so the benefit-vs-cost ratio is even greater for CJK users than for handicapped users.


> I have no problem with 2/3 of the codepoints being taken up by 3 languages.

I have no problems with this either, at least not principally. But historically this was literally impossible. Someone thought of a clever hack that seemed like a good idea at the time, but turns out it doesn't work all that great after all (at least, according to some – opinions seem to differ and I can't really judge myself) and now you're stuck with it and fixing isn't so easy – I don't know if people have made concrete proposals for fixing this, but if it was easy it probably would have been done already. Sometimes sticking with a suboptimal "legacy" solution is better than replacing it with a new better solution due to the friction and issues involved.


25 years ago computers and networks were different. Today text is a <0.1% of traffic compared with video and you have billions of bytes of RAM in every pocket computer. So yes, you can pile on more emojis and no one would be bothered.

Unicode may have been never adopted at all if it had even larger set for CJK and made all CJK texts 1.5-2x larger than in Han-unified version, due to longer encoding. Also: UTF-8 did not exist yet and most systems treated text as arrays of fixed length characters.


UTF-8 existed 25 years ago, but in Plan9.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: