Unicode 16 now includes retro video game sprites [pdf]

mmastrac · on Sept 13, 2024

These are cute, but another sign that Unicode is straying from their original mandate which was to represent characters already in use in other systems (and to help bootstrap under-digitized languages to become digital).

Stop adding random emoji. Don't add fictional languages, no matter how cool. Don't do.. this.

By continuing to extend Unicode like this, they risk diluting their core purpose and creating unnecessary complexity. Unicode should remain focused on its original goal and not cater to niche or novel additions.

EDIT: I'm certain there's a proposal submitter out there that contorted the argument beyond the reasonable point that these existed in an old Usborne programming book text as inline images and need to be represented. I'm going to try to hunt it down.

EDIT 2: The original proposal? https://www.unicode.org/L2/L2019/19025-terminals-prop.pdf Some of the proposed characters might have come from an older submission as well https://unicode.org/L2/L2021/21234-terminals-smalltalk.pdf

mysterymath · on Sept 14, 2024

I'm definitely supportive of the original "Symbols for Legacy Computers" effort. We've been able to make great use of it in the llvm-mos toolchain for legacy systems; we can directly encode special characters for these computers directly in the UTF-8 source text of modern C++, then use C++ user-defined string literals to convert them to their original byte codes. It's an aid to hobbyist use and preservation of these systems, and thankfully, there's a relatively small fixed number of them to support, especially relative to the space available in Unicode. It should unlock quite a few of these kinds of projects for exploring and experiencing the history of computing.

https://llvm-mos.org/wiki/Character_set

zzo38computer · on Sept 14, 2024

It is, as the linked document says, a bug in clang, so it is clang that should be fixed.

mysterymath · on Sept 14, 2024

See https://discourse.llvm.org/t/rfc-enabling-fexec-charset-supp...

There's a lot of discussion about this; practically changing the execution character set isn't trivial, and it's not necessarily even a desirable feature. Even with full -fexec-charset support, it still makes sense to provide compile-time translation from Unicode to target strings. For example, Commodore PETSCII makes vastly more sense as an execution character set than a source character set.

zzo38computer · on Sept 14, 2024

> practically changing the execution character set isn't trivial, and it's not necessarily even a desirable feature

I think it should be unnecessary to convert the character set; the execution shouldn't care about the character set except for ASCII and for whatever you program yourself in what the specific program you are writing is doing. It should not need to be a subset of Unicode, either; you should be able to use any character set that is a superset of ASCII (and where bytes in the ASCII range always mean ASCII characters and bytes not in the ASCII range always mean non-ASCII characters) (UTF-8 has this property and therefore may be used, but it is not the only character encoding with this property).

The C preprocessor is limited in its capabilities, although it would be helpful to add extra steps both before and after the preprocessor runs, which can transform character encodings, but also can be useful for other purposes too. (With GCC, I think this could be done by -no-integrated-cpp and -wrapper; I don't know about doing with Clang.)

(GCC will convert input to UTF-8 during preprocessing, but at least with the version of GCC that I have does not actually care if it is valid UTF-8 (at least for C; maybe not for C++ but I have not tried it), which is fortunate, since this means that you can implement your own character code handling.)

In the case of C++, as described there, you can use user-defined literals. They shouldn't require user-defined literals to be UTF-8 (nor Unicode), although if you can do whatever calculation you want on them at compile-time, then you can treat them as UTF-8 if you want to, but shouldn't be required to do so. (Personally, I do not use C++, though; so I do not actually know all of the details about how it is working, so I may have made a mistake.)

(There are several reasons you might deliberately not want UTF-8. One of them is security issues with the complicated text rendering involved with Unicode. Another might be the way that character widths are working. And there are many other possibilities, too. You might also prefer to put all non-ASCII text in a separate file; the #embed command can be used if you want to embed it into the program anyways, I suppose.)

> Even with full -fexec-charset support, it still makes sense to provide compile-time translation from Unicode to target strings.

Maybe, but I should think that this compile-time translation should be done separately as described above, and to be programmable to not be limited to only Unicode. It should not be required; I think it would be sensible that by default it should just pass through directly without conversion regardless of what the character set is.

> For example, Commodore PETSCII makes vastly more sense as an execution character set than a source character set.

I agree, but that is because Commodore PETSCII is not a superset of ASCII which is encoded as a superset of ASCII. The reason for this has nothing to do with Unicode.

floxy · on Sept 14, 2024

The real question is why don't we just have a <begin SVG> and <end SVG> Unicode tag, and let people go to town with whatever they want in the glyph space. It seems like that's what people really want? What are the down-sides? Sending obscene characters becomes easier. You lose the "meaning", since maybe Apple's version of the unicorn horse-shoe (unicorn-shoe?) has a different SVG encoding than Google's. Security-type denial-of-service-issues, with people rendering Mandelbrot fractal glyphs?

lifthrasiir · on Sept 14, 2024

Well, they tried multiple times! Some tried a direct image, some tried a compact hash of images that should be known in advance ("Coded Hashes of Arbitrary Images"), another tried a direct mapping to Wikidata entities ("QID emojis"). They ultimately failed because... yeah, they would have been much more difficult to implement than what we have right now.

mysterymath · on Sept 14, 2024

This wouldn't have worked for the llvm-mos use case I mentioned; the actual identity is the useful part. I'd expect each of these symbols was associated with a numeric code on a legacy computing platform, and having a Unicode assignment makes it possible to machine-convert text from these systems to and from Unicode.

lifthrasiir · on Sept 13, 2024

The thing is that, additional emojis and mere symbols like these are technically much easier to add than under-digitized languages which require much more research. (The initial introduction of emojis wasn't easy, let me clarify.) These characters are just another entry to the code chart and Unicode character database without any additional mechanism to fit. So it should be okay as long as there are not too many characters added in this way.

Reply to EDIT: People who made the proposals are already well known in Unicode and they know what they are proposing. Michael Everson for example is responsible for encoding many other writing systems other than such symbols.

astrange · on Sept 13, 2024

The main reason to add new emoji is that it gets people to update their Unicode standards / phone OSes / etc. Otherwise they won't care enough to get the other less interesting stuff.

throw-the-towel · on Sept 14, 2024

Who's "people"? Consumers aren't gonna buy a new phone just to get U+22E5B VOMITING MALE PLATYPUS WITH DESCENDER, and system implementers are free to add the new emoji and ignore everything else in the standard.

troad · on Sept 14, 2024

> U+22E5B VOMITING MALE PLAT[Y]PUS WITH DESCENDER

Speak for yourself, I need this now!

In general, I think new emoji is going to be more persuasive to the average person to update than 'bugfixes to VCard deserialisation in the Contacts app' or the other usual release notes.

ksec · on Sept 14, 2024

But they are going to update to new iOS just to see that annoying emoji people keep using.

astrange · on Sept 14, 2024

Oh yes they are going to buy a new phone for that, at least if their friends send it to them and they can't see it. But mostly it just encourages them to install new free OS updates sooner.

rbits · on Sept 14, 2024

They're not going to buy a new phone for it, but it might get them to update their OS

whycome · on Sept 14, 2024

Darn. my phone can't run that os.

I guess I'll upgrade

lifthrasiir · on Sept 14, 2024

Reminder: U+22E5B is an actual Unicode character, namely 𢹛 (which comes from Hanyu Dacidian and I have no further public information available, but anyway). New emoji usually goes into the specific portion of the Plane 1 so U+1Fxxx would have been more believable ;-)

pxc · on Sept 14, 2024

What kind of complexity is actually entailed by adding additional codepoints? Does it require changes to the structure and composition of the encoding, or is it just more stuff that's exactly the same in form as the other stuff?

gwervc · on Sept 14, 2024

I agree. Unicode needs to be splitted in two: one serious part focusing on the original mission and real world documented scripts, the other doing whatever it wants with stupid emojis.

Sniffnoy · on Sept 14, 2024

These were added precisely because they were in use in older systems, so I don't see what you're complaining about here. The emoji expansion is annoying, but this isn't an instance of that; this is specifically adding characters from legacy computing character sets for compatibility purposes.

nine_k · on Sept 14, 2024

WRT representing characters already in use, are the powerline characters, widely used in the wild, standardized yet? They are rather few, and abstract.

Esras · on Sept 14, 2024

Echoing a similar sentiment, so supporting legacy retro video game sprites are very important to ensure are in and never change, but including flags is not okay? I re-read the justification for it recently and it still doesn't hold water, because it felt like it boiled down to "It's hard."

I was thinking about how Minecraft has a system of components and layers that let you compose various flags on their banners. Obviously that's far, far simpler than country (and autonomous region, and county, and province, and and and) flags that can include text, symbols, and practically entire images. But I did wonder if there was some way that could be represented. Unfortunately, I'm not nearly well-versed enough in code points and their ilk to propose anything useful.

But, I am torn. Archival projects are important, too, and language evolves. These decisions will live for potentially hundreds or thousands (Linear A) of years, and interoperability in computing is important.

lifthrasiir · on Sept 14, 2024

Flags are in Unicode, they are just encoded like "the flag of United States" instead of "seven horizontal stripes and a specific arrangement of 50 white stars in a top-left blue background". (And the flag of US changes, although it hadn't been for many decades.)

Esras · on Sept 14, 2024

I was specifically referring to this: https://blog.unicode.org/2022/03/the-past-and-future-of-flag...

lifthrasiir · on Sept 14, 2024

In that case, the post clearly identifies the problem:

> Identities are fluid and unstoppable which makes mapping them to a formal unchanging universal character set incompatible.

If you really want to have identity flags encoded in spite of that, you don't really need Unicode's blessing. The pride flag is already not a single character anyway, it's U+1F3F3 U+FE0F U+200D U+1F308 (white flag -- emoji force -- ZWJ -- rainbow) and you can always create new ZWJ sequences with your own font. Or you can make a font that automatically synthesizes flags from some ZWJ sequence pattern, which is no longer semantically valid but should be much more flexible. Once they got sufficiently popular, there is no other reason that your new ZWJ sequence(s) shouldn't go into Unicode per se.

Unicode's decision to not process non-country flag emoji proposals is because they are closely tied with (minority) groups and Unicode wasn't expected to do any resulting conflict resolution. If you can somehow resolve that problem in advance, then you should probably do that first and propose what you've done.

db48x · on Sept 13, 2024

Good, I’ve been waiting for the LARGE TYPE PIECE LOWER LEFT CROTCH.

qingcharles · on Sept 13, 2024

Here's the rest of the updates:

https://www.unicode.org/versions/Unicode16.0.0/

CJefferson · on Sept 13, 2024

Does anyone know the history of why these were added?

Some of them clearly look like copyrighted characters from specific games, space invaders and PAC MAN in particular.

lifthrasiir · on Sept 13, 2024

All Unicode characters have their associated proposals [1] [2] which you can check them directly. In particular, some copyrighted characters that can't be made generic enough are indeed omitted, and everything else have received generic character names based on their appearances.

[1] https://www.unicode.org/L2/L2021/21234-terminals-smalltalk.p...

[2] https://www.unicode.org/L2/L2021/21235r-terminals-supplement...

nemomarx · on Sept 13, 2024

I notice that the pac man example has a little tail or something on the opposite end from the mouth, so that seems like it might just dodge the copyright / IP?

ThatPlayer · on Sept 13, 2024

The text description for it isn't pac man, but snake head.

So I'm thinking it's more for a snake game sprite than pac man. With a solid block for the body, the "tail" to connect the head so you don't have a floating head.

rbits · on Sept 14, 2024

I don't see anything that looks like a tail. It's just a circle with a quarter triangle taken out.

lmpdev · on Sept 13, 2024

Nah the meat of it is the pixel patterns below those

You could conceivably combine them to create any sprite

grishka · on Sept 14, 2024

An honest question — when will unicode be finished? Or is it now like so many other things in IT, an endless path with no goal?

thristian · on Sept 14, 2024

As long as humans are writing things, they will try to come up with new things to differentiate themselves from the humans who'd gone before, those things will be written down, and we'll probably need new code points to express them properly.

The alternative to "a new version of Unicode every year" is not stasis, but rather new and incompatible encoding schemes frothing like JavaScript frameworks.

shikon7 · on Sept 13, 2024

I wonder if at some point they regret restricting Unicode to 1,114,112 code points.

extraduder_ire · on Sept 14, 2024

That's about 9 times more than are already assigned. I assume they would extend the codespace after that point, and either break UTF-16 or create some hack with it like surrogate-surrogate-pairs.

IIRC, it was initially limited to 16 bits outright.

ksec · on Sept 14, 2024

And yet they unified the CJK characters when they could have been individually assigned.

grishka · on Sept 14, 2024

And didn't give a distinct codepoint to the Turkish "i" that needs to keep the dot when transformed to upper case.

lifthrasiir · on Sept 14, 2024

I think it was a very hard decision in Unicode's part, because ISO/IEC 8859-9 (or more accurately speaking, its 8-bit counterpart) already had aliased a normal Latin lowercase "i" with the Turkish "i" and Unicode had to maintain the equivalence as much as possible.

grishka · on Sept 14, 2024

To me it feels obvious — making them the same codepoint makes case conversions require knowing which language the string is in. Making them separate codepoints does not. The only important question is whether Turks use separate keyboard layouts for typing in Turkish and English, because if they don't, this does also make things complicated, but differently.

lifthrasiir · on Sept 14, 2024

The unification happened because of the 16-bit restriction. I'm actually for the unification in general anyway, otherwise virtually every CJK character would have been confusable to many Z-variants.

acdha · on Sept 14, 2024

That’s something of a historic artifact: they hadn’t yet given up on the idea of fitting in 16-bit integers, and China threw its weight around. I don’t think anyone working on a Unicode now would make the same decision.

lifthrasiir · on Sept 13, 2024

It took almost 3 decades until the previously unassigned Plane 3 got assigned. There would be "some point" if the humanity and also Unicode continues to strive, but that wouldn't be in this century.

ChrisArchitect · on Sept 13, 2024

https://news.ycombinator.com/item?id=41504952

unclad5968 · on Sept 13, 2024

That's cool, but why? Does unicode really need to include spritesheets for a 4 way running character?

lifthrasiir · on Sept 13, 2024

As always, round-trippability across different existing (though legacy) character sets has been a major factor in encoding them.

floxy · on Sept 13, 2024

Bipolar transistors, but no field effect transistors? Left and right facing diodes, but not up and down?

jdlshore · on Sept 14, 2024

These are encoding symbols from 80s microcomputers, so presumably the lack is in the source material.

withinboredom · on Sept 14, 2024

Really looking forward to representing hexadecimals with these new 'sextant' characters....

jonathrg · on Sept 14, 2024

Why are some of the octant mosaics missing? Most obviously, there is no character with all or none of the octants filled. My best case is that there are existing symbols to fill in the gaps.

grose · on Sept 14, 2024

Indeed, the full one is in the Block Elements block (and the empty one is a space, I guess) https://www.unicode.org/charts/PDF/U2580.pdf

flembat · on Sept 14, 2024

Is there a font to use that contains these? I am really glad to see them. In the past when hoping to write a game in a terminal I found only one space invader symbol.

lmpdev · on Sept 13, 2024

I wonder if you could create a text based sprite rendering system using the generic pixel sprite blocks here?

Would be an interesting project

flockonus · on Sept 14, 2024

Faces of dice would be useful to include.

grose · on Sept 14, 2024

2680-2685 in the Miscellaneous Symbols block: https://www.unicode.org/charts/PDF/U2600.pdf

db48x · on Sept 14, 2024

The Unicode Consortium doesn’t try to guess or predict which characters will be useful. Instead they go by which characters are actually attested in actual things humans have written. Thus we have characters like ꙮ U+A66E CYRILLIC LETTER MULTIOCULAR O, which was a doodle used by a monk in a single manuscript in place of a normal “o” in the phrase “many‐eyed seraphim” (“many‐eyed” has an “oo” in the middle in Slavonic). Note that the monk further illuminated this character with red ink; he was having some fun with it.

https://en.wikipedia.org/wiki/Cyrillic_O_variants#/media/Fil...