So, if I have to display user-entered text (usernames, posts, comments, messages, form data, etc), and I want to do The Right Thing™:
- I cannot rely on user locale, because it might be set to something generic like English, or the user may be bi-lingual.
- I cannot rely on location, because the user may be traveling to a different CJK region, or somewhere else altogether.
- I cannot set a single lang: attribute for the whole page because it'll be wrong for the other two languages.
- The string alone is not sufficient to identify the language because you can write valid sentences in different CJK languages with the same codepoints.
- I cannot have a per-user language setting, because users may be bi-lingual.
What does that leave me? A dropdown list "C/J/K/Other" besides every single text field?
I'm chucking this on my pile of examples of software development being hopelessly broken by design, along with "unix time is non-monotonic and discontinuous at random" (hint: what's the unix time exactly 1e8 seconds, ~3 years, from now? Answer: it's up to the astronomers[1]!).
Edit: actually, even the dropdown list is insufficient because it only allows one language per string! How is a Japanese user asking for help learning Chinese supposed to write?
Trying to do something smart is usually the wrong approach and leads to users tangling themselves in invisible state that they do not understand and can't change. The best thing would have been for Unicode to not do Han unification. The second best would be to provide alternate glyphs now. The third best is to display the characters either in the language they're written in, when you know that for sure (usually for text that you wrote yourself), or in the user's most likely locale, when you don't. For locale I would go down this list of traits and pick the first that matches:
1) The language setting on your website if you have one and have translated it to C/J/K. You may use different TLDs for the different languages and discern that way, too.
2) The list of preferred languages from the browser. This is usually unreliable, but if someone has gone to the trouble of inputting "english=1;japanese=.9;chinese=.8", then it's a fair bet they want Japanese Kanji usually and will be understanding if you use them in place of Chinese Han characters.
3) The country to which the user's IP belongs. The least ideal option, but if you're in Korea and reading a random string of Hanzi, you probably expect them to look like Hanzi.
You will show the wrong characters to some users, but the behaviour is understandable. "Oh, the site is showing me Korean characters because I'm in Korea." is a lot easier to grasp than "The site is showing me Chinese characters because I clicked a dropdown one time that I forgot about and now I have no idea why my name is written wrong!"
You can argue about point 2) that some users might set their language preferences and forget about it, but so far I have never observed a user who doesn't know about them messing with the setting.
As many websites do not convey the language properly they try to fix that using some heuristics, but especially for small texts (like messages) that can easily fail hard leading to completely wrong pronunciation.
Mail for example tried to fix that by allowing you to annotate the language for UTF-8 embedded in mail headers and the content. Ironically while that mechanism works for display names it doesn't work for email addresses itself. And I'm not sure any mail program uses it.
My thought whilst reading the article was that the Han unification would actually help screen readers. IIUC The meaning of the glyph is the same across all the languages, so the screen reader will get the correct meaning and can present it according to local settings. The problem with the European languages is that the different characters (letter variations, accent variations) can change the meaning of the word they're part of.
The same Han character can have wildly different pronunciations even in a single language (being a logogram, they represent a word and not a sound). KS X 1001, the primary Korean character set, even duplicated same characters according to their readings so that they can be almost [1] correctly converted back to Hangul. In practice they didn't work well though, and Unicode assigns all but one duplicate characters into the compatibility region.
[1] These readings didn't take account for systematic variations like the initial sound law (두음법칙, for example 이 vs. 리 at the beginning of words).
You can't split CJK sentences into individual letters, inspect them one by one and decipher its exact meaning. If you present Chinese writing to a Japanese speaker, they'd only see complete gibberish consisting of letters they may or may not recognize. It also works the other way around too.
On top of that, Kanjis aren't the only type of letters Japanese folks use. They also use Hiraganas and Katakanas, which are phonetic symbols and totally unrecognizable to non-Japanese speakers.
For what it's worth, mainstream OSes will also have poor handling of these cases, which will eliminate the most tricky cases by the sheer inconvenience it causes.
As far as I know textfields only have one font applied, so entering both languages in a single field won't be optimal. And if you're not doing anything fancy with your fields, they will all take the same font as well.
So even at the input level, the user switching languages will already be mildly screwed, and the best solution would probably be to change pages for each language.
> I cannot set a single lang: attribute for the whole page because it'll be wrong for the other two languages.
> What does that leave me? A dropdown list "C/J/K/Other" besides every single text field?
If you need to control the language display at the granularity of a single text field (rather than a user or a page or a website), then yes, you need a tool that operates on a single text field. This shouldn't be too surprising.
Surely you can get away with one of your other solutions, though. In particular, you can guess the language and be right most of the time.
It is surprising to me, because I can mix Czech, English and German just fine on a single screen, even in a single text field. From that perspective having to say "this whole text is in language X" seems backwards.
Suppose you're blind and use a screen reader. How does German sound when pronounced by an English text-to-speech engine, or vice versa? Which should be used to read out that "single text field" that contains a mix of languages?
When reading mixed-language text we can't know if the "Tee" in the middle of an English sentence is German for "tea" or English slang for "t-shirt". We have to use imperfect context clues, so screen readers will have to do the same, it should be doable with today's technology. And if the software doesn't get it right every time, that's fine, because neither do humans.
> And if the software doesn't get it right every time, that's fine
It really is not, a screen reader using the wrong language is way, way worse than a human with the wrong pronunciation. The first time voiceover decided to switch to russian in the middle of english text I thought the os had crashed, the mangling is quite extreme.
OK, yeah. If Unicode hadn't gone with Han unification (couldn't they know we'd have space to put zillions of characters in a font just a few years later?) you'd have the same flexibility in mixing C/J/K.
> (couldn't they know we'd have space to put zillions of characters in a font just a few years later?)
Representation matters. There should have been plenty of experts who understood that Han unification was problematic. It seems they were not in a position to do anything about it.
Part of the problem back then is they really really wanted to fit it all in 16 bits, and given there are more hanzi than that to start with this was a bit of an issue (back in the 80s the largest dictionary listed >54000 characters, now it’s >100000, and Japanese and Korean total about 50000 each).
A few years later they increased the character space by 5 bits and it wasn’t an issue anymore, but the original legacy of han unification remains.
Yes, and the people who prioritized the 16-bit size were OK with Japanese looking ridiculous in order to achieve their goal. From the article:
> if the equivalent symptom was happening with English text, ιҭ wѳuld bє lѳѳκιng sѳmєҭЋιng lικє ҭЋιs.
From a page linked by the article, "I Can Text You A Pile of Poo, But I Can’t Write My Name":
> To help English readers understand the absurdity of this premise, consider that the Latin alphabet (used by English) and the Cyrillic alphabet (used by Russian) are both derived from Greek. No native English speaker would ever think to try “Greco Unification” and consolidate the English, Russian, German, Swedish, Greek, and other European languages’ alphabets into a single alphabet.
If there had been a proposal to sacrifice English in order to cram Unicode into a certain code space size, is there any question that the people on the panel whose first language was English would have quashed it?
But people who might have spoken out for Japanese, or for Indian languages, etc. were seemingly not in a position to do anything.
Yes, of course — but the point is that native English speakers would have blocked such an absurd proposal no matter what.
Han Unification is just as absurd and just as unacceptable to a native Japanese speaker as Greco Unification would be to a native English speaker. But Han Unification went through because native Japanese speakers were not in a position to block it. Representation matters.
Were they? Or were they worried that their characters would be the ones given 4 bytes, and unifying means not squabbling with China and Korea over who goes first? It's not like Japan was not a technical center at the time. Shift-JIS is still around.
(EDIT: Considering the historical antipathy between Japan and China it's preposterous that the Japanese would unilaterally sacrifice their culture in such a way.)
It makes sense given constraints of the time to try to fit the character set into 16 bits, but Unicode has variation selectors, why not use those for the ambiguous characters? They could have easily done something like HAN_CHARACTER_X + VARIANT_KANJI. It would take up an extra 16 bits, but given the density of CJK text relative to Latin text that may not be a big issue.
So such a variation selector would have to be used in front of every individual character? Or could it be applied across a substring, similar to how the text direction is flipped?
Either way this seems to be the most painless "fix" for this given the current situation. Setting a language for the whole text fails as soon as the languages are mixed.
> unix time is non-monotonic and discontinuous at random
Well, depends on your definition of unix time; if you use time() as the definition then it is actually monotonic because the integral part only repeats on leap seconds?
>How is a Japanese user asking for help learning Chinese supposed to write?
They write with japanese kanji. I imagine minuscule differences in kanji forms are negligible compared to general unfamiliarity with the foreign language.
I'm a native-CJK user myself and well aware this phenomenon, but honestly it's not really that bad in most of cases due to the following reasons:
1. Websites that are in Japanese are likely tagged with lang=ja already. So they will display fine. Unfortunately, this practice seems to be less followed by Chinese sites. I checked a few top sites, qq.com do have lang=zh-cn, while baidu.com and sina.com.cn don't.
2. Majority of UI elements in OS will prioritize the display language you set when choosing variants. This means, if the users are reading content in Japanese while also using Japanese UI, the glyphs would be correct. Of course, this will cause problem if a Japanese is reading Chinese or vice versa, but such scenario is in minority.
Another scenario, which I think is more common, is when someone is using a Latin-language UI. For example, lots of my (Chinese/Japanese) friends are using English UI while reading Chinese/Japanese a lot. The OS in this case will default to one variant (I believe Apple by default would choose Japanese) and therefore display another language's glyphs wrong (side note: for web pages, desktop browsers often have their own font/glyph fallback logic above the OS one).
3. Most of people are just not sensitive to such thing. I pointed it out to lots of people (when due to their setting, some glyphs are displayed wrong, like 门), and they can't care less.
Also, there is no simple "fix" if you have multi-language content. Without manually assign <lang> tag to every single string, you can't display both Japanese and Chinese correct at the same time. It isn't worth the hassle for just a few phrases in text. A good example is Wikipedia, they have templates for all kinds of languages so you can display them correctly even if it's just one Japanese word on, say, English Wikipedia. And Wiki editors do use them all the time!
What you're saying is true (although I'm not sure about 3, all the Japanese people I've talked to are annoyed by that), but it really sucks that we're dealing with problems that were supposed to be fixed by Unicode.
Han unification have been a huge mistake, to save a few thousands of characters, and now we keep piling on more and more stupid emoji.
> Han unification have been a huge mistake, to save a few thousands
> of characters, and now we keep piling on more and more stupid emoji.
This is my exact problem with Unicode. I've very grateful for the efforts that they have made in the past, but the change from "spare valuable codepoints at the expense of causing ambiguity in text" to "assign a new codepoint to every cartoon permutation of intangible nouns" is infuriating.
Han unification and Emojis are from a different time. We're talking about a decision made about 30 years ago in the early 90s; Unicode was 16 bits (65k total codepoints) at the time.
The original CJK Unified Ideographs block from 1992 consists of 21k codepoints. It was impossible to do that four times since 4×21k is more than 65k, and we're going to need space for some other languages as well. Why not make Unicode larger? Well, size was a real concern back then (still is, to some degree, but less so).
Since then Unicode has expanded and now we have slightly under 1 million codepoints. Han character blocks have extended to about 93k codepoints today, and 4 times ~93k codepoints is actually feasible. But now you run in to compatibility issues: you can't remove all the old Han unification stuff (it will break text, big no-no), so you need to re-define it all anew. Is that better? How about mixing "old" Han unified codepoints with new Japanese or Chinese stuff? Will it really improve things or just cause endless confusion (see: combining characters)?
For scale, all of Unicode currently defines about 145k codepoints; so even with Han unification we're talking about two thirds being taken up by just these three languages.
In comparison there are currently about 3,000 emojis, although the number of codepoints is much less since many codepoints are re-used (e.g. "firefighter" is "person + firetruck", flags use the country code, etc.). In a quick check it looks like there are about 1,000 to 1,500 codepoints reserved for emojis. In comparison, this is nothing.
What I'm trying to say is that the (comparatively) very low number of emojis has absolutely no bearing on this and that going off on a tangent about it is very misplaced.
I have no problem with 2/3 of the codepoints being taken up by 3 languages. Right now we (rightly) bend over backwards to accompany handicapped users, often tripling or quadrupling our QA. CJK users are much more common than handicapped users, so the benefit-vs-cost ratio is even greater for CJK users than for handicapped users.
> I have no problem with 2/3 of the codepoints being taken up by 3 languages.
I have no problems with this either, at least not principally. But historically this was literally impossible. Someone thought of a clever hack that seemed like a good idea at the time, but turns out it doesn't work all that great after all (at least, according to some – opinions seem to differ and I can't really judge myself) and now you're stuck with it and fixing isn't so easy – I don't know if people have made concrete proposals for fixing this, but if it was easy it probably would have been done already. Sometimes sticking with a suboptimal "legacy" solution is better than replacing it with a new better solution due to the friction and issues involved.
25 years ago computers and networks were different. Today text is a <0.1% of traffic compared with video and you have billions of bytes of RAM in every pocket computer. So yes, you can pile on more emojis and no one would be bothered.
Unicode may have been never adopted at all if it had even larger set for CJK and made all CJK texts 1.5-2x larger than in Han-unified version, due to longer encoding. Also: UTF-8 did not exist yet and most systems treated text as arrays of fixed length characters.
Is there a reason unicode doesn't have such a builtin lang tag? Similar to right-to-left and left-to-right it could help in displaying differently otherwise identical text.
It could be stored in-band with the text with little changes to existing systems. The only change would be on the presentation layer, and if the tag were to be a non printable character, it would be backward compatible. An input device could implicitly tag input texts depending on the default lang.
You need some form of sanitization, but you need it for right-to-left and left-to-right already.
The reasoning is that anything related to styling is out-of-scope for Unicode (except where needed for round-trip compatibility with other character sets), or else people will also want tags for bold, italic, monospace, or (expressed semantically) for emphasis, code, etc. That’s what markup languages like HTML are for.
Actually, this exists already – there's U+E0001 (LANGUAGE TAG) and U+E007 (CANCEL TAG) and you can put a language code between those, e.g. "\uE0001ja-JP\ue007".
Its use is also deprecated and discouraged. According to [1] it's often not needed, and [2] states that it puts a lot of burden on implementations and best done at a higher level such as HTTP, HTML, etc.
I have no opinion on [1] as I don't speak these languages, but I do know I really hate working with these "invisible characters" in Unicode both as a user and developer. Copy an extra invisible LTR thingy or display variant codepoint and stuff can look and behave different, and it may not at all be obvious what the hell is going on (especially for those without a technical background).
> there's U+E0001 (LANGUAGE TAG) and U+E007 (CANCEL TAG) and you can put a language code between those, e.g. "\uE0001ja-JP\ue007".
"ja-JP" part is also written in tag characters, so it's actually E0001 E006A E0061 E002D E004A E0050 E007F and doesn't render even in unsupported environments.
I agree that for CJK-natives it's not such a big deal probably (unless they live their life in more than one of those languages). For people like me who primarily use their computer in English but also do some stuff in Japanese every now and then it's very frustrating. Of course at this point I know what's going on when 直 or whatever looks wrong, but it's still frustrating.
OSs and Browsers having their own logic for it actually makes things _worse_ in some cases. Windows is especially bad (different types of UI elements care about different settings or don't care at all, so good luck having apps render correctly if you don't change your entire OS locale), and Chrome is pretty bad too, again especially on Windows. Overall MacOS/iOS and Safari does the best job by far.
The failed attempt at Han Unification[1] is the worst decision the Unicode people have ever made.
> The failed attempt at Han Unification[1] is the worst decision the Unicode people have ever made.
At first I nodded my head in agreement, but then I decided I still think the failure to include separate code points for "lower case Turkish dotted I" and "upper case Turkish dotless I" is worse.
You can't have 'ı' ≡ lc( uc 'ı' ) unless you already know you are processing Turkish ... completely unnecessary complication.
Turkish I unification, at least, wasn't a decision the Unicode people made, they inherited the mistake from earlier encodings. Given that those already existed, the alternative to having broken casefolding was, essentially, break all mixed Turkish documents transcoded from cp857 containing both "I" and "i" in non-Turkish functional directives, i.e. you'd necessarily break things like HTML documents without consistent tag casing.
I am having a hard time seeing how having the option of distinct codepoints would break anything.
Consider, İ/i where it is _possible_ to do lossless case conversion:
> lc( 'İ' ) becomes i followed by COMBINING DOT ABOVE which means uc(lc 'İ') becomes LATIN CAPITAL LETTER I WITH DOT ABOVE as a by product of the fact that perl6 deals in graphemes[1]
say 'İ' eq 'İ'.lc.uc.lc.uc;
True
If an extra codepoints existed for Turkish dotted I, such contortions would not be necessary and this would have had no implications for existing working code at the time (nothing says those codepoints must be used, they just give smart software options).
Now, there is nothing one can do with I/ı that will make 'ı' eq 'ı'.uc.lc.uc.lc true without extra information. If codepoints existed, then such special casing and carrying around extra information would not have been necessary.
Also note:
# The letter Ö is not considered to be a variant of the letter O,
# and is a separate letter in the Swedish alphabet. The former
# character is, however, the accepted alternative in contexts where
# Ö cannot be used. Earlier practice substituted OE, which is no
# longer recommended but will still be encountered.
#
U+00F6 # LATIN SMALL LETTER O WITH DIAERESIS
It should not have been too hard to say "The letter İ is not considered to be a variant of the letter I" and vice versa for the lower case versions.
IMO the most jarring issue that comes from Han unification without proper language setting isn't with the glyph variants, since you actually encounter these variants and more in everyday life (albeit more rarely than your national standard ones). The more jarring issue is when your software selects a font meant for the wrong language, and the font for the correct language as a fallback. Then you may encounter serious style issues where your text is pockmarked by glyphs for another language; you have a similar phenomenon with European language using Latin alphabet with unconventional accents. But note that it takes way more effort to ask a CJK foundry to cover all codepoints for all languages than to ask a Latin font designer to cover all languages, so you would be hard pressed to find fonts that actually do that.
Without Han unification, this wouldn't really be a problem, but Han unification is to a large extent the same philosophy pursued with unification of Latin scripts (and other scripts)
> 3. Most of people are just not sensitive to such thing. I pointed it out to lots of people (when due to their setting, some glyphs are displayed wrong, like 门), and they can't care less.
I think its' because most people don't deal with it in big amounts. I heard a lot more complaints from people using android phones that didn't have jp fonts by default. At the third of fourth page they started to care, and once they noticed it frustration just stacked (it's just a matter of adding fonts, so not a big deal).
Otherwise writings are flexible enough for small variants to not be triggering (I mean, people can already read calligraphy...)
If you buy a Xiaomi phone for instance anywhere outside Japan, and open a Japanese website, it will be displayed with Chinese glyphs. It depends on the phone but there will setup needed: sometimes just switching the whole phone language will do the deal, sometimes you need to add the right fonts yourself.
My original motive to write this page came most from issues in video games, where text is often displayed using custom routines and built-in rules in the OS & browser can't be of help. The issue crops up most often in indie games, but they can be seen even in high-profile high-budget games like Half-Life: Alyx or Resident Evil 4 VR.
My experience with web crawling is that the use of the lang-tag seems inconsistent at best. To make matters worse, sometimes content is straight up mislabeled, although Japanese sites often helpfully declare that they are using the Shift_JIS charset rather than UTF-8, which is at least somewhat helpful in figuring out that it is Japanese.
Out of curiosity, what do you do when you put a quote in Chinese from a Chinese author inline in Japanese text? Are you expected to write it using the Chinese forms of the characters, or do you write them using the Japanese forms?
Edit: I mean what is the expected (grammatically correct) way to do it if you were writing with pen on paper.
Unfortunately this (and linked) article only represents Japanese issues. If you blindly apply these suggestions Chinese or Korean users may have issues. I'll list Korean issues below primarily because I'm Korean, but you may want to interview actual CJK users (one of each, not a single user) for testing.
> Line breaking rules
This should link to W3C Requirements for CJK Text Layout [1]. The Wikipedia article alone doesn't fully describe the complexity of CJK typography.
CJK languages are common in that they all have classes of punctuations that can't be separated by a newline. But there is one more thing to consider for Korean: both word-based breaking and character-based breaking is possible depending on the context. The general rule is to use word-based breaking for larger texts and character-based breaking for smaller texts, but there is no clear threshold so you really want to consult Korean users for testing.
> Messaging Apps: Do not directly hook to the Enter key to submit messages
This advice is also problematic. In pretty much all Japanese and most Chinese IMEs they should go through candidate windows so pressing Enter should not submit messages, but in some Chinese and virtually all Korean IMEs there is no automatic candidate window and pressing Enter should submit messages.
In the ideal world detecting a newline as suggested by the article should have solved this issue, but that got complicated by clueless pan-CJK IME implementations. They generally assume candidate windows even for Korean, so they do not commit texts on Enter and that's very inconvenient for Korean users. Therefore it is rather recommended to detect a newline by default, but also have an option to submit messages on Enter.
Was notified from someone else about the isComposing attribute -- https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEve...
At least for web stuff, do you think checking for this before treating the Enter key as Submit would work in both IMEs with and without input buffers?
The problem is that those clueless IMEs do intercept the Enter key contrary to user expectation, so I think you can't distinguish those IMEs from Chinese and Japanese IMEs that should intercept the Enter key as expected.
Wasn't the whole point of Unicode to have a single encoding that could represent all languages unambiguously so that you don't need any meta-information to display a string ? Is there a reason why they chose to represent characters that are obviously different with the same code point ? Everyone would find it outrageous if they decided to have a single character for the russian м and the english m just because they have the same greek origin...
The reason why is that Unicode was originally 16-bit, and there was no way they could fit everything into 16 bits without CJK unification. Of course later it turned out there was no way they could fit everything into 16 bits anyway, and so they were forced to expand it, and so we now both have a larger Unicode (with all the messes that's caused) but also still have CJK unification...
In principle there is a designated "disunificiation" procedure when it's desirable. More accurately speaking, each CJK character is thought to represent not a single or a few glyphs listed in the code chart but rather a glyphic subset, and the disunification splits that set into partitions. But this is generally applied to a few selected characters and only when it's safe to do so. Massive disunification was to my knowledge never suggested or proposed, and that would surely prompt a large scale disruption throughout CJK users (say, how about existing texts?).
So it's possible to modify the skin tone of emoji but impossible to disunify CJK characters? There are RTL modifiers for Arabic languages, it's impossible for CJK?
It shouldn't be harder than existing unicode handling.
Everything boils down to the interoperability and compatibility.
Emoji was added because Apple and Google had to deal with (then-)Japanese emails, and skin tones were not specified. It were implementations that impose certain skin tones (that do not even match the original Japanese emojis) and as a result Unicode had to introduce a mechanism to change skin tones and mandate the default emoji without that mechanism to be neutral.
RTL "modifiers" are actually formatting characters closely tied with the Unicode Bidirectional Algorithm [1]. Until then texts with both RTL and LTR fragments were handled incoherently, for example legacy character sets were still struggling with logical vs. visual order issues. So they are indeed Unicode inventions, but necessary ones that do not alter existing texts.
For CJK characters Unicode now provides ideographic variation selectors that select the exact glyph (or more accurately, a restricted glyphic subset of the base character). They do not disunify characters but they do provide a strong hint to display those characters in a specified way. In this way they do not cause an additional issue to existing Unicode systems (as they should already do normalization and collation in the Unicode way). The disunification by comparison would almost instantly break existing texts.
The default seems to be yellow though, like original Lego minifigs (which I assume was one of the original plastic brick colors, and a nod to the bright yellow makeup that was popular at the time in Denmark.)
It's a really complicated issue, because whether the character is different or not is debatable. In English writing, we consider a serif A to be the same character as a sans-serif A, even though the glyph is obviously different, and neither do we distinguish between a "French" A and a "German" A.
So what do we do with 国 and 國? The first of those is always used in simplified Chinese and usually in Japanese, while the second is used in traditional Chinese and sometimes in Japanese (eg. names). Is this one, two or three characters?
It’s worth noting this kind of distinction did used to exist in the Latin alphabet as well. For much of the 19th and 20th centuries, German letters were different, as part of a debate about whether German text should be written in blackletter or Latin/Antiqua/what-other-Europeans-use script.
We still have these issues in e.g. Hebrew and Yiddish, Arabic and Persian, and tons of adapted Cyrillic scripts from ex-Soviet states. Not to mention Northern-European accented vowels, and cedilla letters such as Ç.
I'm personally of the belief that the accented and cedilla characters should be exclusively stored as combining character pairs, even if modern keyboard mappings require only a single keypress. My own language stores every character as two bytes (at a minimum), so the storage aspect is a solved problem.
> Everyone would find it outrageous if they decided to have a single character for the russian м and the english m just because they have the same greek origin...
Did you know that some languages distinguish dot less Iı and dotted İi ? English mixes them Ii and unicode needs to know exactly based on what language you might want to upper/lower case because it can't tell an english I appart from a Turkish dotless I.
Even in Latin script languages you have issues if you don't specify a lang tag. e.g. a font's 'fi' ligature may omit the tittle on the 'i', but it is necessary in Turkish. Or you are using a font without coverage for French and your browser renders œ in another font.
The CJK variant issues under not specifying a lang are indeed present in Latin but to a smaller extent.
This is the reason why Adobe PDF isn't relying on Unicode. Adobe products has a huge presence in Japan since 90s and they had to appeal to the printing industry, which is very anal to this kind of issues. So they ended up using a separate encoding for every language. Today, CJK letters in PDF are encoded in Adobe-GB1 (mainland China), Adobe-CNS1 (Hong Kong), Adobe-Japan1 and Adobe-Korea1 respectively. Not the cleanest way, but it gets the job done.
Thanks for the pointer, that's pretty interesting.
Looking at their doc [0] it seems they used their Adobe-Japan1 to wrap a much more wider set of characters than any single encoding standard, including ligatures, vintage encodings etc.
It seems to be a pretty big work and kinda fits with the image of PDF handling being such a monumental beast.
Adobe gets lots of stick for its subscription and malware like Creative Cloud. But they do spend huge amount of resources on CJK fonts, layout and encoding.
As a Japanese learner, this has been a massive disappointment in unicode for me, and a pain in my ass. It has sort of formed into a challenge for me, trying to get the characters to display consistently on all of my devices. Believe it or not, even with pango configured to always show the japanese variants, and fontconfig set to always prefer the JP font, some applications like Firefox find a way to mess it up.
Can't blame them much though, han unification is a huge mess and designed by someone who I can only posit to be entirely brainless. There aren't many characters that are affected, you aren't even saving any considerable amount of codepoints. It's just west-centricism and lack of knowledge on the subject.
The Han unification was done because at that time they hoped that the size of Unicode characters will be limited to 16 bits.
Separate sets of Han characters cannot be encoded in the 16-bit space, but they could have been easily encoded in the current 32-bit space.
Nevertheless, I have never found this to be a problem in practice, because I have always taken care to have good separate typefaces for Japanese, Traditional Chinese and Simplified Chinese.
In documents that I create or modify, I apply styles with the appropriate typeface.
The only possible problems are with Web pages, but the good browsers allow you to configure typefaces for each language and I always configure the correct typefaces.
If the Web page does not specify correctly the language, it might be displayed wrongly, but this is only one of the many stupid things that can be done by a Web page designer that can make that page look ugly when rendered on other computers.
I agree that the fact that I must not forget to configure typefaces per language whenever I install a new browser, while for Chrome you must also install the "Advanced Font Settings" extension before it even becomes possible to choose e.g. a Japanese font, is annoying.
To avoid such configuration work when you prefer better looking typefaces instead of some standard system defaults would require a standardization of how to notify the applications about the association between certain typefaces and languages, e.g. by some environment variables or by some standard locations for the font files, depending on language.
What makes this worse is that some Cyrillic characters like the Cyrillic "а" have a different code point from the Latin "a" despite looking _exactly_ identical. So unicode isn't even consistent with their unification logic.
I believe Cyrillic а and Latin а are different because there already existed legacy encodings where а and а were considered different characters, so Unicode kept the distinction for backward-compatibility.
While there were no existing legacy encodings allowing to write Chinese and Japanese at the same time, so there was nothing to keep compatibility with.
> Japanese text written in incorrect glyph sets will stand out similarly to any native speaker of Japanese, and will give off a connotation that whoever developed this app does not care about this (often large) subset of the global user population.
More likely they'll think the content was written by a non-native Japanese speaker, judge whether that makes you trustworthy or not (based on personal experience or stereotypes or prejudice, probably a bit of all three (we're all human)) and then not buy from you. A good example would be Amazon listings in Japanese that Japanese people can tell were almost certainly written by someone Chinese, and then decide not to buy.
If you want the cash, get a proper translation. Ironically, Japan is filled to the brim with incredibly poor English and abounds with stories of native English speakers' translations and corrections being disregarded because "it doesn't sound right"… to someone who can't string a legible English sentence together.
Am I right in assuming fixing this in player names etc makes Chinese look wrong? It’s an easier problem if you know the whole page is Japanese, but how about things like game lobbies, where every username is in a different language?
Either store the locale used when a user enters their name and then use it to mark up the text whenever you display the username, or simply use the system default, so Japanese users will see Chinese names with Japanese glyphs and Chinese users will see Japanese names with Chinese glyphs. Other users randomly get whatever.
The locale might be set to something completely different. A lot of programmers run their machines in English and not in their native language. One could use location to detect which variant to use, but that too wouldn’t work for, say, Chinese speakers in Japan. In ideal world we should use the locale of an input source (if the user sets their keyboard to Traditional Chinese we should use it for that fragment of text). However, operating systems and browsers don’t provide the input source locale API.
Good point. Would it be feasible to check the locale used at profile creation time, then store that locale alongside the username if it contains at least one CJK glyph?
What if a player with a German locale used their favorite anime character’s name? How do you know whether to use Chinese or Japanese characters on somebody else’s PC? Even a Chinese player should see Japanese characters there. But you would first of all basically need to ask the German player with a drop down menu which language their name is in, which will never happen, so we just assume Chinese. It’s just broken.
I think we clearly need an in-band solution. Some character that switches the Asian glyph variant, or separate characters altogether. The former would be annoying for non-variable sized Unicode because you’d lose the ability to use random access into a large corpus of text, because you’d need to scan the entire text to find out the current Asian glyph variant mode.. sigh
Ultimately there are situations that aren't really fixable. People often mix languages in chat messages.
There's some similar issues with right to left, and left to right text. You can give people a good default, and try to be smart, but some cases will always be ugly.
Wonder how well HN and my phone handles this. There are supposed to be Unicode code points that indicate which locale a character is supposed to be displayed in. If things are well thought out, my phone should add them automatically and HN should keep them and your browser should render it correctly
On an iPhone using,
Chinese simplified keyboard: 刃
Japanese keyboard: 刃
So that didn’t go very well. When choosing the character on my Chinese keyboard it is displayed with correct Chinese strokes but turns into the Japanese version in the text box. I’m guessing for most of you reading, both will appear Chinese.
EDIT: Someone better than me at wrangling unicode can maybe try out the variation selectors, and print the correct variations in a comment. I think it would have been neat if my keyboard ime did it for me :)
> If the glyphs don’t exactly look like the Japanese result sample below, your code is displaying Japanese wrong.
Maybe exactly isn't the right term here; it doesn't need to be pixel-perfect, there are still different font faces just like with western languages, for example one that's supposed to make them look more natural or hand written and one for print, etc.
Also, afaict han unification was a mistake, but if you thought you only ever have 65535 code points available it might have been tempting.
How do you do it correctly in a bi-lingual app? Say your app is in English but you want to display asian language file names. Is there any way to tell if a string is chinese or japanese? I think CJK variation selectors embedded in the string are not widely used. And it would be a bit overkill to include a language detection heuristic (which would likely fail for short phrases). So should you let the user decide? Default to Japanese on a Japanese PC, otherwise leave it undefined?
I didn't know about this, but can't help to think this sounds like a bug in unicode to me. If these characters are different then why does unicode assign one codepoint to them? Wasn't the promise of unicode to exactly not do this kind of thing?
Can this be fixed? New character code for ambiguous characters could be assigned, of course this would require manual conversion (with knowledge of the variant) for existing data, but at least it would make this issue go away moving forward (and unconverted legacy data would be "just as bad" as it used to be, so no loss).
This issue with Han Unification was a big reason for stalled adoption of Unicode/UTF-8 in Ruby for years. UTF-8 by default came to Ruby after 1.9 where they've added thorough support for variety of encodings, so that UTF-8 is not the only option.
Han Unification started in the 90s when computers were big, memory small, UTF-8 did not exist and people were trying to fit all characters in a reasonable amount of codepoints. Today with variable-length encoding of UTF-8 and video streams over 5G, supporting all variants as distinct codepoints, and patching text search and sorting with more "normalization" algorithms would not be a problem at all.
But no, there is no particular reason to introduce a longer encoding than the modern UTF-8 (which is actually shortened from the original one-to-six-byte encoding). The current set of 1,114,112 Unicode characters is sufficient for at least the foreseeable future, because any new assignment requires a demonstrable historic or current use. (Emojis are slightly different, but they still require that the underlying concept is widespread and do not significantly overlap with existing emojis. See [1].) Han characters are the largest source of new assignments to this date and they are yet to reach two out of 17 full planes (that would equate to 131K characters).
The other approach is to assign to each language distinct codepoints, but I guess the current approach is better for backward compatibility with pre-Unicode, and less redundancy in Latin script documents.
The initial three examples don't have a corresponding 'correct render' image next to them, so it's impossible for me to tell since they all render as the same character (which is incorrect given the lack of context).
Checking the source, the page _is_ specifying language tags in the span, which I guess is supposed to help. My system just must not have fonts for those languages so I obviously can't even test them.
There may be issues displaying it on Firefox. Chrome and Safari seems to have displayed it correctly on my end. I'll find time to replace them with images so they appear correct regardless of environment.
For Chinese, the font can change the writing slightly. For example, 刃(blade) can be any of these (Japanese, Simplified Chinese and Traditional Chinese). Actually I would consider the Japanese version in the article the traditional Chinese version:
https://duckduckgo.com/?q=%E5%88%83+%E4%B9%A6%E6%B3%95&iax=i...
At least for Chinese, the difference is font, they are all valid writing for the character. Different writing style can cause these minor difference as well: http://qiyuan.chaziwang.com/pic/ziyuanimg/E58883.png
If you look at right side of above image, you can tell how the same character is written in different writing style
It is less a problem for Chinese. Our brain has trained to read them, I would recognize the Japanese version, Simplified Chinese version, and Traditional Chinese version without noticing the difference. But I can imagine it can be a problem for Japanese, and other people do not read Simplified Chinese. Having the locale explicitly set to a country and load the correct font make a lot sense here.
Does this also explain why alphabetic text in Japanese apps and websites often looks so horrible? Like very wide characters with way too much space in between them?
No, the wide characters (it's their name) are special code points. I guess it exists because someone wanted to be able to use one letter in place of a Japanese character while using the same width. The "normal" letters are called "half-width" here.
Full-width characters are relics from multiple legacy character sets. For example JIS X 0208, the primary Japanese two-byte character set, has a set of alphanumeric characters in the row 0x23, but their widths are not specified and it is totally possible to map them into half-width characters when no other character sets are in use. However it is most commonly paired with JIS X 0201 which is a single-byte character set with their own alphanumeric characters, so anything from JIS X 0201 is made half width and anything from JIS X 0208 is made full width to simplify implementations. This practice got stuck and subsequently followed by Unicode. Same for other languages.
Alphabetic text in Japanese fonts are primarily designed for documents mainly in Japanese with the occasional Latin script jargon. There's a variant (full-width) that's sometimes used that indeed is very wide, made to be of the width of the Japanese kanji, but even the proportional ones are pretty light and widely spaced (which results in better typography in mainly-Japanese documents)
Newspaper websites may also have years-old internal typesetting rules, carried over from paper, that mandate alphabetical text must appear in full-width (double wide). They look ugly even to native Japanese, and some newspapers have gradually been learning to break out of it.
The latter. In HTML you can specify a specific DOM element as being in a specific language so the browser can render it properly, but if the place you want to quote text isn't as allowing (eg. comment sections with no HTML allowed), there may be no way to ensure correct glyphs.
Ideographic variation selectors plus a very large pan-CJK font may solve this issue in the future, but CJK fonts have already reached the OpenType limit of 65,535 glyphs so we are already running into technical issues.
It's a web browser only issue. In other cases such as a text processor document or local app a font is explicitly used by any run of text so there is no problem. This issue is the web being what it is, most of the time there is no font explicitly specified for text and the browser use a Chinese-looking font to display any Chinese characters.
It’s not just a web browser issue. For example I’m transferring data in multiple Asian languages through some network API. I always need to specify the locale of the text data in a separate data field so that some UI program at the end can display the text correctly. And even then that’s not perfect, because that’s just the system locale instead of the IME locale.
I think this page does a poor job of explaining that all 3 knives blade in the first example share the same code point in Unicode, but are to be displayed/rendered differently depending on which language it is shown as part of.
It is there in the text, but it’s almost hidden between the lines.
If I was a developer with no knowledge of Han characters or Han unification I would have to read two thirds of this article thinking I’m doing it right, so why am I reading this, e.g.: “but I am using the correct code point. It’s the character that the user entered!” or “I copy pasted it from a Japanese text, what do you mean I’m using the wrong character?” before reaching the “how to fix it” and even then I might not realize the root cause.
With that in my mind I might not even make it to the part about how to fix the problem and learn that I am using the right character/code point, but it is still displayed wrong.
I agree the page is somewhat roundabout in its current state since I went from the background to the symptom to the fix. Open to suggestions on rearranging the article so that more devs can implement fixes.
I would have the knives blades chart a little earlier, and make it very obvious that each character shares a code point (maybe have a code point column) and talk about in the text that yes, this is weird. 3 visually distinct Unicode characters share a single code point.
For me this was very hard to wrap my head around the first time I encountered the problem. Maybe other people find it hard to understand in different ways.
I believe that Unicode even claims that distinctly looking characters are to have their own code points, but similarly looking characters should share a code point (e.g, there is no French a and English a, even though they are pronounced differently. And Danish ø and Swedish ö are pretty much the same pronunciation but differently written, so they don’t share a code point.)
It's more of a problem in pure text apps rather than the Web. For example, in editors (not the rich text ones), console, interface elements. But yes, it is a problem for people who knows (or learns) and uses multiple languages at once, e.g. English, Chinese, and Japanese.
I am surprised how many comments here never heard of Han Unification. The problem is not new, and some of us have been ranting about it for more than a decade. From the UTF-8-Everywhere Manifesto in 2012 on HN [1], And a search [2] on HN dates back to 2010.
I am also surprised at the support this problem now has. At least on this thread. Generally speaking Han Unification problem dont get much if any support on HN. Not even empathy. In the name of having Unicode becomes king they would much rather sacrifice the CJK language.
The answer or replies were always, it is "glyph" problem, not "code" problem. Stop asking Unicode to solve it.
patio11 aka Patrick McKenzie from Stripe has been the most vocal critics of Han Unification. Sums it up far better than I could, quote [3]:
>Reason the Han unification debate in Unicode got so acrimonious, and why lots of Japanese people carry a chip on their shoulder about it to this day.
>"Sorry, grandma, I know you've been sort of attached to your name for the last 80 years, but the white folks find it inconvenient for their computer systems. Don't worry, they promise they'll make something close for you."
>Many of the clients of my ex-day job are married to legacy encodings like Shift-JIS precisely because they do think that their customers and students have a "right" to having their names written correctly.
As mentioned in my other reply, Adobe gets lots of stick for its subscription and malware like Creative Cloud. But they do [4] spend huge amount of resources on CJK fonts, layout and encoding ( They have their own separate Encoding for each CJK language instead of using Unicode ). Part of the reason why I like PDF.
>"Sorry, grandma, I know you've been sort of attached to your name for the last 80 years, but the white folks find it inconvenient for their computer systems. Don't worry, they promise they'll make something close for you."
Is there a resource to read more about this? I don't get that vibe from things like:
This can also be a problem in chat app. I used en-US on windows (which defaulted to the zh variant) and someone else used ja-JP and I was wondering why the character was different. Took a while to notice that we were seeing two different things on our screens.
We also have a website about a Japanese game using a Japanese font except for 0x9bd6. The font's 0x9bd6 is the CN variant and its 0xe001 is the JP variant of 0x9bd6. Fun times.
Like others said, on the web, you pretty much have to manually assign lang to every single thing. We just added support for CN/TW/KR text. I should come back and check 0x9bd6 in the other versions ...
>> However, this issue is much more than the difference between, say, the lowercase A with the overhang (a) or without (α).
Yes but actuallly "α" is the Greek character alpha, whreas "a" is the Latin character "a". So if you displayed "α" as "a" to a Greek person that, too, would look αλλ ωρονγ.
Can't this be solved somewhat by adding a "cjk mode" zero-width character, like we have right-to-left/left-to-right embedding characters? Yes, yes, it's yet another standard, but there doesn't seem to be any way to indicate in the text stream itself what characters to use otherwise.
My own programs are specifically designed to not use Unicode. I think that Unicode is really messy and I dislike it. If you want to display Japanese text, EUC-JP can be used.
Not that I like unicode much either -- amongst other things the idiotic arrangement of codepoints makes it basically impossible to do remotely efficient text processing; e.g. here's a graph of the automaton the RE2 uses to check if something is an uppercase character:
In practice it does, if it will be rendered at all. Elsewhere on this very page you can find people suggesting storing the display locale alongside the unicode string, which is really the only way to solve this problem in the general case - but in that case you might as well store pairs of byte sequence and encoding, there's not much difference between that and unicode string and locale.
> In practice it does, if it will be rendered at all.
Seems like you're right, at least as far as Firefox is concerned. Testing the data links below, it appears to guess the default language based on the encoding used. Neat!
Aren't the modern text display APIs of the most popular OSes all Unicode-based now? It seems likely that they will convert to Unicode when told to display a string in a different codepage and replace the locale info with the default Unicode behavior (of basing it on the user locale)
So, if I have to display user-entered text (usernames, posts, comments, messages, form data, etc), and I want to do The Right Thing™:
- I cannot rely on user locale, because it might be set to something generic like English, or the user may be bi-lingual.
- I cannot rely on location, because the user may be traveling to a different CJK region, or somewhere else altogether.
- I cannot set a single lang: attribute for the whole page because it'll be wrong for the other two languages.
- The string alone is not sufficient to identify the language because you can write valid sentences in different CJK languages with the same codepoints.
- I cannot have a per-user language setting, because users may be bi-lingual.
What does that leave me? A dropdown list "C/J/K/Other" besides every single text field?
I'm chucking this on my pile of examples of software development being hopelessly broken by design, along with "unix time is non-monotonic and discontinuous at random" (hint: what's the unix time exactly 1e8 seconds, ~3 years, from now? Answer: it's up to the astronomers[1]!).
[1]: https://en.wikipedia.org/wiki/Unix_time#Leap_seconds
Edit: actually, even the dropdown list is insufficient because it only allows one language per string! How is a Japanese user asking for help learning Chinese supposed to write?