If you ask an LLM what color is the sky it might say purple but if you give it a paragraph describing the atmosphere and then ask the same question it will almost always answer correctly. I don't think hallucinations are as big of a problem as people make them out to be.
Are you just writing negative posts without even seeing the product? The system queries the internet, aggregates that information and writes information based on your query.
The problem with LLMs is not a data problem. LLMs are stupid even on data they just generated.
One recent catastrophic failure I found: Ask an LLM to generate 10 pieces of data. Then in a second input, ask it to select (say) only numbers 1, 3, and 5 from the list. The LLM will probably return results numbered 1, 3, and 5, but chances are at least one of them will actually copy the data from a different number.
I'm absolutely not bullish on LLMs, but I think this is kinda judging a fish on its ability to climb a tree.
LLMs are looking at typical constructions of text, not an understanding of what it means. If you ask it what color the sky is, it'll find what text usually follows a sentence like that, and tries to construct a response from it.
If you ask it the answer to a math question, the only way it could reliably figure it out is if it has in its database an exact copy of that math question. Asking it to choose things from a list is kinda like that, but one could imagine that the designers would try to supplement that manually with a different technique from pure LLM.
Any ideas why that misnumbering happens? It sounds a very basic thing to get wrong. And as a fallback, could be brute-force kludged with an extra pass which appends the output list to the prompt.
> You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
> Alas, that does not remotely resemble how people are pitching this technology.
Why does this get downvoted so heavily? It’s my experience running LLM in production. At scale hallucinations are not a huge problem when you have reference material.