> Why does, 'mat' follow from 'the cat sat on the ...' You're confidently incorr...

> Why does, 'mat' follow from 'the cat sat on the ...'

You're confidently incorrect by oversimplifying all LLMs to a base model performing a completion from a trivial context of 5 words.

This is tantamount to a straw man. Not only do few people use untuned base models, it completely ignores in-context learning that allows the model to build complex semantic structures from the relationships learnt from its training data.

Unlike base models, instruct and chat fine-tuning teaches models to 'reason' (or rather, perform semantic calculations in abstract latent spaces) with their "conditional probability structure", as you call it, to varying extents. The model must learn to use its 'facts', understand semantics, and perform abstractions in order to follow arbitrary instructions.

You're also confabulating the training metric of "predicting tokens" with the mechanisms required to satisfy this metric for complex instructions. It's like saying "animals are just performing survival of the fittest". While technically correct, complex behaviours evolve to satisfy this 'survival' metric.

You could argue they're "just stitching together phrases", but then you would be varying degrees of wrong:

For one, this assumes phrases are compressed into semantically addressable units, which is already a form of abstraction ripe for allowing reasoning beyond 'stochastic parroting'.

For two, it's well known that the first layers perform basic structural analysis such as grammar, and later layers perform increasing levels of abstract processing.

For three, it shows a lack of understanding in how transformers perform semantic computation in-context from the relationships learnt by the feed-forward layers. If you're genuinely interested in understanding the computation model of transformers and how attention can perform semantic computation, take a look here: https://srush.github.io/raspy/

For a practical example of 'understanding' (to use the term loosely), give an instruct/chat tuned model the text of an article and ask it something like "What questions should this article answer, but doesn't?" This requires not just extracting phrases from a source, but understanding the context of the article on several levels, then reasoning about what the context is not asserting. Even comparatively simple 4x7B MoE models are able to do this effectively.