This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.
Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.
That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.
Thanks. I trust that you're more familiar with the internals than myself, so I stand corrected.
I'm only speaking from personal usage experience, and don't trust benchmarks since they are often gamed, but if this process produces objectively better results that aren't achieved by scaling up alone, then that's a good thing.
> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.
Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.
Research/benchmarks aside, try giving a somewhat hard programming task to Opus 4 with reasoning off vs. on. Similarly, try the same with o3 vs. o3-pro (o3-pro reasons for much longer).
I'm not going to dig through my history for specific examples, but I do these kinds of comparisons occasionally when coding, and it's not unusual to have e.g. a bug that o3 can't figure out, but o3-pro can. I think this is widely accepted by engineers using LLMs to help them code; it's not controversial.
Huh, I wasn't aware that reasoning could be toggled. I use the OpenRouter API, and just saw that this is supported both via their web UI and API. I'm used to Sonnet 3.5 and 4 without reasoning, and their performance is roughly the same IME.
I wouldn't trust comparing two different models, even from the same provider and family, since there could be many reasons for the performance to be different. Their system prompts, training data, context size, or runtime parameters could be different. Even the same model with the same prompt could have varying performance. So it's difficult to get a clear indication that the reasoning steps are the only changing variable.
But toggling it on the same model would be a more reliable way to test this, so I'll try that, thanks.
It depends on the problem domain you have and the way you prompt things. Basically the reasoning is better, in cases where using the same model to critique itself in multiple turns would be better.
With code, for example, if a single shot without reasoning would have hallucinating a package or not conformed to the rest of the project style. Then you ask the llm check. Then ask it to revise itself to fix the issue. If the base model can do that - then turning on reasoning, basically allows it to self check for the self-correctable features.
When generating content, you can ask it to consider or produce intermediate deliverables like summaries of input documents that it then synthesizes into the whole. With reasoning on, it can do the intermediate steps and then use that.
The main advantage is that the system is autonomously figuring out a bunch of intermediate steps and working through it. Again no better than it probably could do with some guidance on multiple interactions - but that itself is a big productivity benefit. The second gen (or really 1.5 gen) reasoning models also seem to have been trained on enough reasoning traces that they are starting to know about additional factors to consider so the reasoning loop is tighter.
Reasoning cannot actually be toggled. LLM companies serve completely different models based on whether you have reasoning enabled or disabled for "Opus 4".
This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.
The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.