is GLM 4.5 better than Qwen3 coder??

diggan · 2025-07-29T14:19:32 1753798772

For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.

My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.

They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.

kelvinjps10 · 2025-07-29T15:16:19 1753802179

coding? they are coding models? what specific tasks is one performing better than the other?

diggan · 2025-07-29T15:51:18 1753804278

They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.

> what specific tasks is one performing better than the other?

That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".

reverius42 · 2025-07-30T07:15:34 1753859734

> coding isn't one homogeneous activity that one model beats all the other models at

If you can't even replace one coding model with another, it's hard to imagine you can replace human coders with coding models.

Philpax · 2025-07-30T11:52:47 1753876367

You probably can't replace a seasoned COBOL programmer with a seasoned Haskell programmer. Does that mean that either person is bad at programming as a whole?

reverius42 · 2025-07-30T13:06:15 1753880775

This was my point -- if programmers are not fungible, how can companies claim to be replacing them by the thousands with AI?

Philpax · 2025-07-30T17:42:19 1753897339

You don't need to use the same model/system for every task. "AI" isn't a monolith; there's a spectrum of solutions for a spectrum of problems, and figuring out what's applicable to your problem today is one of the larger problems of deployment.

diggan · 2025-07-30T10:35:27 1753871727

What you mean "can't even replace"? You can, nothing in my comment says you cannot?

whimsicalism · 2025-07-29T15:54:57 1753804497

glm 4.5 is not a coding model

simonw · 2025-07-29T15:58:28 1753804708

It may not be code-only, but it was trained extensively for coding:

> Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.

From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/

whimsicalism · 2025-07-29T16:00:28 1753804828

yes, all reasoning models currently are, but it’s not like ds coder or qwen coder

simonw · 2025-07-29T16:02:43 1753804963

I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.

Am I missing something?

whimsicalism · 2025-07-29T16:23:17 1753806197

I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.

NitpickLawyer · 2025-07-29T14:22:19 1753798939

I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).