Had to read it again as well but yeah, that's how I'd understand it too. So the ...

Had to read it again as well but yeah, that's how I'd understand it too. So the "offset in block" tokens are still not the same tokens as for the "real" ASCII letters, but they are the same tokens for all "weird ascii-like Unicode blocks". So the model can aggregate the training data from all those blocks and automatically "generalize" to similar characters in other blocks (by learning to ignore the "block identifier" tokens) even ones that have very little or no training examples themselves.

Edit: So this means if you want to sanitize text before passing it to an LLM, you don't only have to consider standard Unicode BMP characters but also everything that mirrors those characters in a different block. And because models can do Cesar ciphers with small offsets, possibly even blocks where the characters don't line up completely but are shifted by a small number.

Maybe it would be better to run the sanitizer on the tokens or even the embedding vectors instead of the "raw" text.