I've only read the abstract, but also find this strange. I wonder if this is just tapping into the computational chains that are already available when tokens are further away, due to the positional encodings being trained that way. If so, that makes the reasoning/modeling powers of LLMs even more impressive and inscrutable.