These are cute, but another sign that Unicode is straying from their original mandate which was to represent characters already in use in other systems (and to help bootstrap under-digitized languages to become digital).
Stop adding random emoji. Don't add fictional languages, no matter how cool. Don't do.. this.
By continuing to extend Unicode like this, they risk diluting their core purpose and creating unnecessary complexity. Unicode should remain focused on its original goal and not cater to niche or novel additions.
EDIT: I'm certain there's a proposal submitter out there that contorted the argument beyond the reasonable point that these existed in an old Usborne programming book text as inline images and need to be represented. I'm going to try to hunt it down.
I'm definitely supportive of the original "Symbols for Legacy Computers" effort. We've been able to make great use of it in the llvm-mos toolchain for legacy systems; we can directly encode special characters for these computers directly in the UTF-8 source text of modern C++, then use C++ user-defined string literals to convert them to their original byte codes. It's an aid to hobbyist use and preservation of these systems, and thankfully, there's a relatively small fixed number of them to support, especially relative to the space available in Unicode. It should unlock quite a few of these kinds of projects for exploring and experiencing the history of computing.
There's a lot of discussion about this; practically changing the execution character set isn't trivial, and it's not necessarily even a desirable feature. Even with full -fexec-charset support, it still makes sense to provide compile-time translation from Unicode to target strings. For example, Commodore PETSCII makes vastly more sense as an execution character set than a source character set.
> practically changing the execution character set isn't trivial, and it's not necessarily even a desirable feature
I think it should be unnecessary to convert the character set; the execution shouldn't care about the character set except for ASCII and for whatever you program yourself in what the specific program you are writing is doing. It should not need to be a subset of Unicode, either; you should be able to use any character set that is a superset of ASCII (and where bytes in the ASCII range always mean ASCII characters and bytes not in the ASCII range always mean non-ASCII characters) (UTF-8 has this property and therefore may be used, but it is not the only character encoding with this property).
The C preprocessor is limited in its capabilities, although it would be helpful to add extra steps both before and after the preprocessor runs, which can transform character encodings, but also can be useful for other purposes too. (With GCC, I think this could be done by -no-integrated-cpp and -wrapper; I don't know about doing with Clang.)
(GCC will convert input to UTF-8 during preprocessing, but at least with the version of GCC that I have does not actually care if it is valid UTF-8 (at least for C; maybe not for C++ but I have not tried it), which is fortunate, since this means that you can implement your own character code handling.)
In the case of C++, as described there, you can use user-defined literals. They shouldn't require user-defined literals to be UTF-8 (nor Unicode), although if you can do whatever calculation you want on them at compile-time, then you can treat them as UTF-8 if you want to, but shouldn't be required to do so. (Personally, I do not use C++, though; so I do not actually know all of the details about how it is working, so I may have made a mistake.)
(There are several reasons you might deliberately not want UTF-8. One of them is security issues with the complicated text rendering involved with Unicode. Another might be the way that character widths are working. And there are many other possibilities, too. You might also prefer to put all non-ASCII text in a separate file; the #embed command can be used if you want to embed it into the program anyways, I suppose.)
> Even with full -fexec-charset support, it still makes sense to provide compile-time translation from Unicode to target strings.
Maybe, but I should think that this compile-time translation should be done separately as described above, and to be programmable to not be limited to only Unicode. It should not be required; I think it would be sensible that by default it should just pass through directly without conversion regardless of what the character set is.
> For example, Commodore PETSCII makes vastly more sense as an execution character set than a source character set.
I agree, but that is because Commodore PETSCII is not a superset of ASCII which is encoded as a superset of ASCII. The reason for this has nothing to do with Unicode.
The real question is why don't we just have a <begin SVG> and <end SVG> Unicode tag, and let people go to town with whatever they want in the glyph space. It seems like that's what people really want? What are the down-sides? Sending obscene characters becomes easier. You lose the "meaning", since maybe Apple's version of the unicorn horse-shoe (unicorn-shoe?) has a different SVG encoding than Google's. Security-type denial-of-service-issues, with people rendering Mandelbrot fractal glyphs?
Well, they tried multiple times! Some tried a direct image, some tried a compact hash of images that should be known in advance ("Coded Hashes of Arbitrary Images"), another tried a direct mapping to Wikidata entities ("QID emojis"). They ultimately failed because... yeah, they would have been much more difficult to implement than what we have right now.
This wouldn't have worked for the llvm-mos use case I mentioned; the actual identity is the useful part. I'd expect each of these symbols was associated with a numeric code on a legacy computing platform, and having a Unicode assignment makes it possible to machine-convert text from these systems to and from Unicode.
The thing is that, additional emojis and mere symbols like these are technically much easier to add than under-digitized languages which require much more research. (The initial introduction of emojis wasn't easy, let me clarify.) These characters are just another entry to the code chart and Unicode character database without any additional mechanism to fit. So it should be okay as long as there are not too many characters added in this way.
Reply to EDIT: People who made the proposals are already well known in Unicode and they know what they are proposing. Michael Everson for example is responsible for encoding many other writing systems other than such symbols.
The main reason to add new emoji is that it gets people to update their Unicode standards / phone OSes / etc. Otherwise they won't care enough to get the other less interesting stuff.
Who's "people"? Consumers aren't gonna buy a new phone just to get U+22E5B VOMITING MALE PLATYPUS WITH DESCENDER, and system implementers are free to add the new emoji and ignore everything else in the standard.
In general, I think new emoji is going to be more persuasive to the average person to update than 'bugfixes to VCard deserialisation in the Contacts app' or the other usual release notes.
Oh yes they are going to buy a new phone for that, at least if their friends send it to them and they can't see it. But mostly it just encourages them to install new free OS updates sooner.
Reminder: U+22E5B is an actual Unicode character, namely 𢹛 (which comes from Hanyu Dacidian and I have no further public information available, but anyway). New emoji usually goes into the specific portion of the Plane 1 so U+1Fxxx would have been more believable ;-)
What kind of complexity is actually entailed by adding additional codepoints? Does it require changes to the structure and composition of the encoding, or is it just more stuff that's exactly the same in form as the other stuff?
I agree. Unicode needs to be splitted in two: one serious part focusing on the original mission and real world documented scripts, the other doing whatever it wants with stupid emojis.
These were added precisely because they were in use in older systems, so I don't see what you're complaining about here. The emoji expansion is annoying, but this isn't an instance of that; this is specifically adding characters from legacy computing character sets for compatibility purposes.
WRT representing characters already in use, are the powerline characters, widely used in the wild, standardized yet? They are rather few, and abstract.
Echoing a similar sentiment, so supporting legacy retro video game sprites are very important to ensure are in and never change, but including flags is not okay? I re-read the justification for it recently and it still doesn't hold water, because it felt like it boiled down to "It's hard."
I was thinking about how Minecraft has a system of components and layers that let you compose various flags on their banners. Obviously that's far, far simpler than country (and autonomous region, and county, and province, and and and) flags that can include text, symbols, and practically entire images. But I did wonder if there was some way that could be represented. Unfortunately, I'm not nearly well-versed enough in code points and their ilk to propose anything useful.
But, I am torn. Archival projects are important, too, and language evolves. These decisions will live for potentially hundreds or thousands (Linear A) of years, and interoperability in computing is important.
Flags are in Unicode, they are just encoded like "the flag of United States" instead of "seven horizontal stripes and a specific arrangement of 50 white stars in a top-left blue background". (And the flag of US changes, although it hadn't been for many decades.)
In that case, the post clearly identifies the problem:
> Identities are fluid and unstoppable which makes mapping them to a formal unchanging universal character set incompatible.
If you really want to have identity flags encoded in spite of that, you don't really need Unicode's blessing. The pride flag is already not a single character anyway, it's U+1F3F3 U+FE0F U+200D U+1F308 (white flag -- emoji force -- ZWJ -- rainbow) and you can always create new ZWJ sequences with your own font. Or you can make a font that automatically synthesizes flags from some ZWJ sequence pattern, which is no longer semantically valid but should be much more flexible. Once they got sufficiently popular, there is no other reason that your new ZWJ sequence(s) shouldn't go into Unicode per se.
Unicode's decision to not process non-country flag emoji proposals is because they are closely tied with (minority) groups and Unicode wasn't expected to do any resulting conflict resolution. If you can somehow resolve that problem in advance, then you should probably do that first and propose what you've done.
All Unicode characters have their associated proposals [1] [2] which you can check them directly. In particular, some copyrighted characters that can't be made generic enough are indeed omitted, and everything else have received generic character names based on their appearances.
I notice that the pac man example has a little tail or something on the opposite end from the mouth, so that seems like it might just dodge the copyright / IP?
The text description for it isn't pac man, but snake head.
So I'm thinking it's more for a snake game sprite than pac man. With a solid block for the body, the "tail" to connect the head so you don't have a floating head.
As long as humans are writing things, they will try to come up with new things to differentiate themselves from the humans who'd gone before, those things will be written down, and we'll probably need new code points to express them properly.
The alternative to "a new version of Unicode every year" is not stasis, but rather new and incompatible encoding schemes frothing like JavaScript frameworks.
That's about 9 times more than are already assigned. I assume they would extend the codespace after that point, and either break UTF-16 or create some hack with it like surrogate-surrogate-pairs.
IIRC, it was initially limited to 16 bits outright.
I think it was a very hard decision in Unicode's part, because ISO/IEC 8859-9 (or more accurately speaking, its 8-bit counterpart) already had aliased a normal Latin lowercase "i" with the Turkish "i" and Unicode had to maintain the equivalence as much as possible.
To me it feels obvious — making them the same codepoint makes case conversions require knowing which language the string is in. Making them separate codepoints does not. The only important question is whether Turks use separate keyboard layouts for typing in Turkish and English, because if they don't, this does also make things complicated, but differently.
The unification happened because of the 16-bit restriction. I'm actually for the unification in general anyway, otherwise virtually every CJK character would have been confusable to many Z-variants.
That’s something of a historic artifact: they hadn’t yet given up on the idea of fitting in 16-bit integers, and China threw its weight around. I don’t think anyone working on a Unicode now would make the same decision.
It took almost 3 decades until the previously unassigned Plane 3 got assigned. There would be "some point" if the humanity and also Unicode continues to strive, but that wouldn't be in this century.
Why are some of the octant mosaics missing? Most obviously, there is no character with all or none of the octants filled. My best case is that there are existing symbols to fill in the gaps.
Is there a font to use that contains these?
I am really glad to see them.
In the past when hoping to write a game in a terminal I found only one space invader symbol.
The Unicode Consortium doesn’t try to guess or predict which characters will be useful. Instead they go by which characters are actually attested in actual things humans have written. Thus we have characters like ꙮ U+A66E CYRILLIC LETTER MULTIOCULAR O, which was a doodle used by a monk in a single manuscript in place of a normal “o” in the phrase “many‐eyed seraphim” (“many‐eyed” has an “oo” in the middle in Slavonic). Note that the monk further illuminated this character with red ink; he was having some fun with it.
Stop adding random emoji. Don't add fictional languages, no matter how cool. Don't do.. this.
By continuing to extend Unicode like this, they risk diluting their core purpose and creating unnecessary complexity. Unicode should remain focused on its original goal and not cater to niche or novel additions.
EDIT: I'm certain there's a proposal submitter out there that contorted the argument beyond the reasonable point that these existed in an old Usborne programming book text as inline images and need to be represented. I'm going to try to hunt it down.
EDIT 2: The original proposal? https://www.unicode.org/L2/L2019/19025-terminals-prop.pdf Some of the proposed characters might have come from an older submission as well https://unicode.org/L2/L2021/21234-terminals-smalltalk.pdf