There is a bit of very important content missing from the explanation of the autocomplete analogy.
The combination of encoding / tokenization of meanings and ideas, related concepts, and mapping these relationships in vector space makes LLMs not so much glorified text prediction engines as browsers/oracles of the sum total of cultural-linguistic knowledge as captured in the training corpus.
Understanding how the implicit and explicit linguistic, memetic, and cultural context is integrated into the idea/concept/text prediction engine helps to show how LLMs produce such convincing output and why they often can bring useful information to the table.
More importantly, understanding this holistically can help people to predict where the output that LLMs can generate will -not- be particularly useful or even may be wildly misleading.
What they capture is not knowledge, it's word relationships.
And that can indeed be powerful, useful and valuable. They're a tool I'm grateful to have in my armoury. I can use it as a torch to shine light into areas of human knowledge which would otherwise be prohibitively difficult to access.
But they're information retrieval machines, not knowledge engines.
I’d argue that they extract knowledge from the training corpus in the same way that knowledge can be encapsulated in a book… it’s just words, after all.
Tokenization goes well beyond words and punctuation. Knowledge and relationships between concepts, reactions, emotions, values, attitudes, and actions all get included in the vector space.
But, it also can come to wrong conclusions, of course.
Ultimately they are information extraction engines that are controlled by semantic search.
They aren’t smart.
But it turns out that in the same way that an infinitely sized and detailed choose-your-own-adventure book at 120 pages per second could be indistinguishable from a simulation of reality, the free traversal of the entirety of the wealth of human culture and knowledge is similarly difficult to distinguish from intelligence.
In the end it may boil down to the simulation vs reality argument.
They extract information in much the same way that an educated but naive reader can extract information from a book. (Thousands of times quicker of course).
But there's a lot more than that going on, both when a book is written, and when it's read by a reader with life experience. A book is an encoding and transmission medium for knowledge - and a very good one - but it isn't the knowledge itself.
Like a musical score for an orchestral symphony isn't the symphony itself. (Granted, reading a score and synthesizing an orchestra is well within the grasp of the models we have now).
Poetry is perhaps the ultimate expression of this, but even at a more factual level - I could read a dozen books on a given religion, and although I might possess more in terms of historical fact or even theological argument, I'd still know less about it than somebody who was raised in that religion. Same with any profession, hobby, or craft.
Encoding the relationships between the words we use for different emotions in a vector space doesn't mean it knows the least thing about those emotions. Even though it can do an excellent job of convincing us that it does in a Turing test scenario.
It’s also why they can produce such hard to identify bullshit and harmful output. I’ve had some really convincing, yet fundamentally flawed, code output that if I hadn’t done about a million code reviews before I might have just used.
And been totally screwed later.
Near as I can tell, that the bullshit is so much more convincing with them is a huge detriment that society really won’t learn to appreciate until it’s gotten really bad. As I noted in another thread, it
allows people to get much further into the ‘fake it until you make it’ hole than they otherwise would.
That 90% of the time it’s fine is what actually makes it all worse.
The combination of encoding / tokenization of meanings and ideas, related concepts, and mapping these relationships in vector space makes LLMs not so much glorified text prediction engines as browsers/oracles of the sum total of cultural-linguistic knowledge as captured in the training corpus.
Understanding how the implicit and explicit linguistic, memetic, and cultural context is integrated into the idea/concept/text prediction engine helps to show how LLMs produce such convincing output and why they often can bring useful information to the table.
More importantly, understanding this holistically can help people to predict where the output that LLMs can generate will -not- be particularly useful or even may be wildly misleading.