For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.
My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.
They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.
They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.
> what specific tasks is one performing better than the other?
That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".
You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?
You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.
It may not be code-only, but it was trained extensively for coding:
> Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.
I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.
I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.
I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).