Exactly. This not only affects a potential critic model, but the entire concept ...

yorwba · 2025-07-16T09:32:47 1752658367

Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.

That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.

imiric · 2025-07-16T13:02:53 1752670973

Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.

I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.

danenania · 2025-07-16T11:35:44 1752665744

> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.

Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.

imiric · 2025-07-16T13:05:43 1752671143

I'm curious: can you link to any tests that prove this?

I don't trust most benchmarks, but if this can be easily confirmed by an apples-to-apples comparison, then I would be inclined to believe it.

danenania · 2025-07-16T13:27:38 1752672458

Check out the DeepSeek paper.

Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).

I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.

imiric · 2025-07-16T14:09:57 1752674997

Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.

I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.

But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.

jacobr1 · 2025-07-16T14:51:04 1752677464

It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.

With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.

When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.

The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.

rishabhjain1198 · 2025-07-17T09:00:18 1752742818

Reasoning cannot actually be toggled. LLM companies serve completely different models based on whether you have reasoning enabled or disabled for "Opus 4".