One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.
It's Quanta being misleading. They mention several papers but end up with this [1] which talks about decoder-only transformers, not LLMs in general, chatbots, or LLM synthesis engines, whatever that means. The paper also proves that CoT-like planning lets you squeeze more computation from a transformer, which is... obvious? but formally proven this time. Models trained to do CoT don't have some magical on-the-fly compositional ability, they just invest more computation (could be dozens millions of tokens in case of o3 solving the tasks from that benchmark).