Because we are told that they can solve IMO problems. Yet they fail at basic math problems, not only at factorization but also when probing them with relatively basic symbolic math that would not require the invocation of an external program.
Also, you know it they fail they could say so instead of giving a hallucinated answer. First the models lie and say that a 20 digit number takes vast amounts of computing. Then, if pointed to a factorization program they pretend to execute it and lie about the output.
There is no intelligence or flexibility apart from stealing other people's open source code.
That's why the IMO results were so notable: that was one of those moments where new models were demonstrated doing something that they had previously been unable to do.
I can't fathom why more people aren't talking about the IMO story. Apparently the model they used is not just an LLM but some RL are involved too. If a model wins gold at IMO, is it still merely a "statistical parrot"?
The same thing has also been achieved by a Google DeepMind team and at least one group of independent researchers using publicly available models and careful promoting tricks.