I wonder if the authors can explain the aparent inconsistency between what we no...

lsy · 2025-02-10T17:33:19 1739208799

I think part of this is that you can't trust the "thinking" output of the LLM to accurately convey what is going on internally to the LLM. The "thought" stream is just more statistically derived tokens based on the corpus. If you take the question "Is A a member of the set {A, B}?", the LLM doesn't internally develop a discrete representation of "A" as an object that belongs to a two-object set and then come to a distinct and absolute answer. The generated token "yes" is just the statistically most-likely next token that comes after those tokens in its corpus. And logical reasoning is definitionally not a process of "gaining confidence", which is all an LLM can really do so far.

CJefferson · 2025-02-10T14:24:06 1739197446

As an example, I have asked tools like deepseek to solve fairly simple Sudoku puzzles, and while they output a bunch of stuff that looks like logical reasoning, no system has yet produced a correct answer.

When solving combinatorics puzzles, deepseek will again produce stuff that looks convincing, but often makes incorrect logical steps and ends up with wrong answers.

Miraste · 2025-02-10T17:26:28 1739208388

Then one has to ask: is it producing a facsimile of reasoning with no logic behind it, or is it just reasoning poorly?

meroes · 2025-02-10T17:49:45 1739209785

Teaching an LLM to solve a full sized Sudoku is not a goal right now. As an RLHF I’d estimate it would take 10-20 hours for a single RLHF’er to guide a model to the right answer for a single board.

Then you’d need thousands of these for the model (or next model) to ingest. And each RLHF’s work needs checking which at least doubles the hours per task.

It can’t do it because RLHF’ers haven’t taught models on large enough boards en masse yet.

And there are thousands of pen and paper games, each one needing thousands of RLHF’ers to train them on. Each game starting at the smallest non trivial board size and taking a year for a modest jump in board size. Doing this in not in any AI company’s budget.

bccdee · 2025-02-10T22:35:56 1739226956

If it were actually reasoning generally, though, it wouldn't need to be trained on each game. It could be told the rules and figure things out from there.

yencabulator · 2025-02-13T21:29:18 1739482158

Even worse, the LLM is supposed to already "know" Sudoku rules. Either that, or it doesn't "know" anything that was scraped from the web...

pama · 2025-02-10T15:35:10 1739201710

Here is o3-mini on a simple sudoku. In general the puzzle can be hard to explore combinatorially even with modern SAT solvers, so I picked one marked as “easy”. It looks to me like it solved it but I didnt confirm beyond a quick visual inspection.

https://chatgpt.com/share/67aa1bcc-eb44-8007-807f-0a49900ad6...

hennell · 2025-02-10T18:33:29 1739212409

And thus we have the AI problems in a nutshell. You think it can reason because it can describe the process in well written language. Anyone who can state the below reasoning clearly "understands" the problem:

> For example, in the top‐left 3×3 block (rows 1–3, columns 1–3) the givens are 7, 5, 9, 3, and 4 so the missing digits {1,2,6,8} must appear in the three blank cells. (Later, other intersections force, say, one cell to be 1 or 6, etc.)

It's good logic. Clearly it "knows" if it can break the problem down like this.

Of course if we stretch ourselves slightly to actually check beyond a quick visual inspection you'd quickly see it actually put a second 4 in that first box despite "knowing" it shouldn't. In fact several of the boxes have duplicate numbers, despite the clear reasoning aboving.

Does the reasoning just not get used in the solving part? Or maybe a machine built to regurgitate plausible text, can also regurgitate plausible reasoning?

pama · 2025-02-10T20:16:33 1739218593

Thanks for spotting this. The solution is indeed wrong. And I agree that the machine can regurgitate plausible reasoning in principle. If it run in a loop, I would bet that it could probably figure this particular problem out eventually, but not sure it matters much in the end. The only plausible way for some of these Sudoku puzzles is a SAT solver and I'm sure that if given the right environment an LLM could just code and execute one and get the answer. Does that mean it can't "reason" because it couldn't solve this Sudoku puzzle, or know that it made a mistake. I'm not sure I'd go this far, but I agree that my example didn't match my claim. The model didnt do a careful job and didn't quadruple check its work as I would have expected from an advanced AI, but remember that this is o3-mini, and not something that is supposed to be full-blown AI yet. If you asked GPT-3.5 for something similar the answer would have been amusingly simplistic, not it is at least starting to get close.

I now wonder if I had a typo when I copied this puzzle from an image to my phone app thus rendering it unsolveable.. the model should still have spotted such an error anyways but ofc it is not tuned to perfection

pama · 2025-02-10T22:20:13 1739226013

Yeah I think this was a wrong puzzle to try according to:

https://sudoku.com/sudoku-solver

A bummer.