You forgot `- 2^11` for the surrogate pairs. Gee, why isn't Unicode 2^21 code po...

jcranmer · 2025-09-02T16:00:35 1756828835

If you're going to count the surrogate pairs as not-a-Unicode-codepoint, you should also count the other noncharacters: the last two codepoints on each of the 17 planes and the range U+FDD0-U+FDEF.

The expansion of Unicode beyond the BMP was designed to facilitate an upgrade compatibility path from UCS-2 systems, but it is extremely incorrect to somehow equate Unicode with UTF-16.

kbolino · 2025-09-02T16:42:00 1756831320

FWIW there is an official term for "code points excluding surrogates", it is "Unicode scalar value".

jeberle · 2025-09-02T20:34:59 1756845299

OK, I'm lost here. Why is there a 1:1 correspondence between the two?