Hacker Newsnew | past | comments | ask | show | jobs | submit | biofox's commentslogin

Thank you. You just unlocked a repressed trauma.

I ask for confidence scores in my custom instructions / prompts, and LLMs do surprisingly well at estimating their own knowledge most of the time.


I’m with the people pushing back on the “confidence scores” framing, but I think the deeper issue is that we’re still stuck in the wrong mental model.

It’s tempting to think of a language model as a shallow search engine that happens to output text, but that metaphor doesn’t actually match what’s happening under the hood. A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

That’s why a confidence number that looks sensible can still be as made up as the underlying output, because both are just sequences of tokens tied to trained patterns, not anchored truth values. If you want truth, you want something that couples probability distributions to real world evidence sources and flags when it doesn’t have enough grounding to answer, ideally with explicit uncertainty, not hand‑waviness.

People talk about hallucination like it’s a bug that can be patched at the surface level. I think it’s actually a feature of the architecture we’re using: generating plausible continuations by design. You have to change the shape of the model or augment it with tooling that directly references verified knowledge sources before you get reliability that matters.


Solid agree. Hallucination for me IS the LLM use case. What I am looking for are ideas that may or may not be true that I have not considered and then I go try to find out which I can use and why.


In essence it is a thing that is actually promoting your own brain… seems counter intuitive but that’s how I believe this technology should be used.


This technology (which I had a small part in inventing) was not based on intelligently navigating the information space, it’s fundamentally based on forecasting your own thoughts by weighting your pre-linguistic vectors and feeding them back to you. Attention layers in conjunction of roof later allowed that to be grouped in higher order and scan a wider beam space to reward higher complexity answers.

When trained on chatting (a reflection system on your own thoughts) it mostly just uses a false mental model to pretend to be a desperate intelligence.

Thus the term stochastic parrot (which for many us actually pretty useful)


Thanks for your input - great to hear from someone involved that this is the direction of travel.

I remain highly skeptical of this idea that it will replace anyone - the biggest danger I see is people falling for the illusion. That the thing is intrinsically smart when it’s not - it can be highly useful in the hands of disciplined people who know a particular area well and augment their productivity no doubt. Because the way we humans come up with ideas and so on is highly complex. Personally my ideas come out of nowhere and mostly are derived from intuition that can only be expressed in logical statements ex-post.


Is intuition really that different than LLM having little knowledge about something? It's just responding with the most likely sequence of tokens using the most adjacent information to the topic... just like your intuition.


With all due respect I’m not even going to give a proper response to this… intuition that yields great ideas is based on deep understanding. LLM’s exhibit no such thing.

These comparisons are becoming really annoying to read.


I think you need to first understand what the word intuition means, before writing such a condescending reply.

Meant to say prompting*


>A model doesn’t “know” facts or measure uncertainty in a Bayesian sense. All it really does is traverse a high‑dimensional statistical manifold of language usage, trying to produce the most plausible continuation.

And is that that different than what we do under the scenes? Is there a difference between an actual fact vs some false information stored in our brain? Or both have the same representation in some kind of high‑dimensional statistical manifold in our brains, and we also "try to produce the most plausible continuation" using them?

There might be one major difference is at a different level: what we're fed (read, see, hear, etc) we also evaluate before storing. Does LLM training do that, beyond some kind of manually assigned crude "confidence tiers" applied to input material during training (e.g. trust Wikipedia more than Reddit threads)?


I would say it's very different to what we do. Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that". Even if a human absorbed every single bit of information a language model has, their brain probably could not store and process it all. Unless they were a liar, they'd tell you they don't know the answer either! So I personally reject the framing that it's just like how a human behaves, because most of the people I know don't lie when they lack information.


>Go to a friend and ask them a very niche question. Rather than lie to you, they'll tell you "I don't know the answer to that"

Don't know about that, bullshitting is a thing. Especially online, where everybody pretends to be an expert on everything, and many even believe it.

But even if so, is that because of some fundamental difference between how a human and an LLM store/encode/retrieve information, or more because it has been instilled into a human through negative reinforcement (other people calling them out, shame of correction, even punishment, etc) not to make things up?


I see you haven’t met my brother-in-law.

Hallucinations are a feature of reality that LLMs have inherited.

It’s amazing that experts like yourself who have a good grasp of the manifold MoE configuration don’t get that.

LLMs much like humans weight high dimensionality across the entire model then manifold then string together an attentive answer best weighted.

Just like your doctor occasionally giving you wrong advice too quickly so does this sometimes either get confused by lighting up too much of the manifold or having insufficient expertise.


I asked Gemini the other day to research and summarise the pinout configuration for CANbus outputs on a list of hardware products, and to provide references for each. It came back with a table summarising pin outs for each of the eight products, and a URL reference for each.

Of the 8, 3 were wrong, and the references contained no information about pin outs whatsoever.

That kind of hallucination is, to me, entirely different than what a human researcher would ever do. They would say “for these three I couldn’t find pinouts” or perhaps misread a document and mix up pinouts from one model for another.. they wouldn’t make up pinouts and reference a document that had no such information in it.

Of course humans also imagine things, misremember etc, but what the LLMs are doing is something entirely different, is it not?


Humans are also not rewarded for making pronouncements all the time. Experts actually have a reputation to maintain and are likely more reluctant to give opionions that they are not reasonably sure of. LLMs trained on typical written narratives found in books, articles etc can be forgiven to think that they should have an opionion on any and everything. Point being that while you may be able to tune it to behave some other way you may find the new behavior less helpful.


Newer models can run a search and summarize the pages. They're becoming just a faster way of doing research, but they're still not as good as humans.


> Hallucinations are a feature of reality that LLMs have inherited.

Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?

As a simple example, I asked Google about houseplant biology recently. The answer was very confidently wrong telling me that spider plants have a particular metabolic pathway because it confused them with jade plants and the two are often mentioned together. Humans wouldn’t make this mistake because they’d either know the answer or say that they don’t. LLMs do that constantly because they lack understanding and metacognitive abilities.


>Huh? Are you arguing that we still live in a pre-scientific era where there’s no way to measure truth?

No. A strange way to interpet their statement! Almost as if you ...hallucinated their intend!

They are arguing that humans also hallucinate: "LLMs much like humans" (...) "Just like your doctor occasionally giving you wrong advice too quickly".

As an aside, there was never a "pre-scientific era where there [was] no way to measure truth". Prior to the rise of modern science fields, there have still always been objective ways to judge truth in all kinds of domains.


Yes, that’s basically the point: what are termed hallucinations with LLMs are different than what we see in humans – even the confabulations which people with severe mental disorders exhibit tend to have some kind of underlying order or structure to them. People detect inconsistencies in their own behavior and that of others, which is why even that rushed doctor in the original comment won’t suggest something wildly off the way LLMs do routinely - they might make a mistake or have incomplete information but they will suggest things which fit a theory based on their reasoning and understanding, which yields errors at a lower rate and different class.


> Hallucinations are a feature of reality that LLMs have inherited.

Really? When I search for cases on LexisNexis, it does not return made-up cases which do not actually exist.


When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.

Since your example comes from the legal field, you'll probably very well know that even well intentioned witnesses that don't actively try to lie, can still hallucinate all kinds of bullshit, and even be certain of it. Even for eye witnesses, you can ask 5 people and get several different incompatible descriptions of a scene or an attacker.


>When you ask humans however there are all kinds of made-up "facts" they will tell you. Which is the point the parent makes (in the context of comparing to LLM), not whether some legal database has wrong cases.

Context matters. This is the context LLMs are being commercially pushed to me in. Legal databases also inherit from reality as they consist entirely of things from the real world.


It's not even a manifold https://arxiv.org/abs/2504.01002


A different way to look at it is language models do know things, but the contents of their own knowledge is not one of those things.


You have a subtle slight of hand.

You use the word “plausible” instead of “correct.”


That’s deliberate. “Correct” implies anchoring to a truth function the model doesn’t have. “Plausible” is what it’s actually optimising for, and the disconnect between the two is where most of the surprises (and pitfalls) show up.

As someone else put it well: what an LLM does is confabulate stories. Some of them just happen to be true.


It absolutely has a correctness function.

That’s like saying linear regression produces plausible results. Which is true but derogatory.


Do you have a better word that describes "things that look correct without definitely being so"? I think "plausible" is the perfect word for that. It's not a sleight of hand to use a word that is exactly defined as the intention.


I mean... That is exactly how our memory works. So in a sense, the factually incorrect information coming from LLM is as reliable as someone telling you things from memory.


But not really? If you ask me a question about Thai grammar or how to build a jet turbine, I'm going to tell you that I don't have a clue. I have more of a meta-cognitive map of my own manifold of knowledge than an LLM does.


Try it out. Ask "Do you know who Emplabert Kloopermberg is?" and ChatGPT/Gemini literally responded with "I don't know".

You, on the other hand, truly have never encountered any information about Thai grammar or (surprisingly) hot to build a jet turbine. (I can explain in general terms how to build one from just watching Discovery channel)

The difference is that the models actually have some information on those topics.


How do you know the confidence scores are not hallucinated as well?


They are, the model has no inherent knowledge about its confidence levels, it just adds plausible-sounding numbers. Obviously they _can_ be plausible, but trusting these is just another level up from trusting the original output.

I read a comment here a few weeks back that LLMs always hallucinate, but we sometimes get lucky when the hallucinations match up with reality. I've been thinking about that a lot lately.


> the model has no inherent knowledge about its confidence levels

Kind of. See e.g. https://openreview.net/forum?id=mbu8EEnp3a, but I think it was established already a year ago that LLMs tend to have identifiable internal confidence signal; the challenge around the time of DeepSeek-R1 release was to, through training, connect that signal to tool use activation, so it does a search if it "feels unsure".


Wow, that's a really interesting paper. That's the kind of thing that makes me feel there's a lot more research to be done "around" LLMs and how they work, and that there's still a fair bit of improvement to be found.


In science, before LLMs, there's this saying: all models are wrong, some are useful. We model, say, gravity as 9.8m/s² on Earth, knowing full well that it doesn't hold true across the universe, and we're able to build things on top of that foundation. Whether that foundation is made of bricks, or is made of sand, for LLMs, is for us to decide.


It doesn't hold true across the universe? I thought this was one of the more universal things like the speed of light.


G, the gravitational constant is (as far as we know) universal. I don't think this is what they meant, but the use of "across the universe" in the parent comment is confusing.

g, the net acceleration from gravity and the Earth's rotation is what is 9.8m/s² at the surface, on average. It varies slightly with location and altitude (less than 1% for anywhere on the surface IIRC), so "it's 9.8 everywhere" is the model that's wrong but good enough a lot of the time.


It doesn't even hold true on Earth! Nevermind other planets being of different sizes making that number change, that equation doesn't account for the atmosphere and air resistance from that. If we drop a feather that isn't crumpled up, it'll float down gently at anything but 9.8m/s². In sports, air resistance of different balls is enough that how fast something drops is also not exactly 9.8m/s², which is why peak athlete skills often don't transfer between sports. So, as a model, when we ignore air resistance it's good enough, a lot of the time, but sometimes it's not a good model because we do need to care about air resistance.


Gravity isn't 9.8m/s/s across the universe. If you're at higher or lower elevations (or outside the Earth's gravitational pull entirely), the acceleration will be different.

Their point was the 9.8 model is good enough for most things on Earth, the model doesn't need to be perfect across the universe to be useful.


g(lower case) is literally gravitational force of Earth at surface level. It's universally true, as there's only one Earth in this universe.

G is the gravitational constant which is also universally true(erm... to the best of our knowledge), g is calculated using gravitational constant.


they 100% are unless you provide a RUBRIC / basically make it ordinal.

"Return a score of 0.0 if ...., Return a score of 0.5 if .... , Return a score of 1.0 if ..."


LLMs fail at causal accuracy. It's a fundamental problem with how they work.


Asking an LLM to give itself a «confidence score» is like asking a teenager to grade his own exam. I LLMs doesn’t «feel» uncertainty and confidence like we do.

There has certainly been an overreaction, and it continues to be the case even after efforts have been walked back.

I have yet to hear a good justification for why people who are not interested in programming should be encouraged to become interested purely in the name of equality, yet my institution is still spending huge amounts of public money on trying to achieve exactly that.


Because "not being interested in programming" might not be the only cause for the lack of representation.


Flaky when under pressure? Irritating results? Sites look and feel better without it?

Sounds appropriate to me.


R and Matlab workflows have been fairly stable for the past decade. Why is the Python ecosystem so... unstable? It puts me off investing any time in it.


The R ecosystem has had a similar evolution with the tidyverse, it was just a little further ago. As for Matlab, I initially learned statistical programming with it a long time ago, but I’m not sure I’ve ever seen it in the wild. I don’t know what’s going on there.

I’m actually quite partial to R myself, and I used to use it extensively back when quick analysis was more valuable to my career. Things have probably progressed, but I dropped it in favor of python because python can integrate into production systems whereas R was (and maybe still is) geared towards writing reports. One of the best things to happen recently in data science is the plotnine library, bringing the grammar of graphics to python imho.

The fact is that today, if you want career opportunities as a data scientist, you need to be fluent in python.


Mostly what's going on with Matlab in the wild is that it costs at least $10k a seat as soon as you are no longer at an academic institution.

Yes, there is Octave but often the toolboxes aren't available or compatible so you're rewriting everything anyway. And when you start rewriting things for Octave you learn/remember what trash Matlab actually is as a language or how big a pain doing anything that isn't what Mathworks expects actually is.

To be fair: Octave has extended Matlab's syntax with amazing improvements (many inspired by numpy and R). It really makes me angry that Mathworks hasn't stolen Octave's innovations and I hate every minute of not being able to broadcast and having to manually create temp variables because you can't chain indexing whenever I have to touch actual Matlab. So to be clear Octave is somewhat pleasant and for pure numerical syntax superior to numpy.

But the siren call of Python is significant. Python is not the perfect language (for anything really) but it is a better-than-good language for almost everything and it's old enough and used by so many people that someone has usually scratched what's itching already. Matlab's toolboxes can't compete with that.


I love R, but how can you make that claim when R uses three distinct object-oriented systems all at the same time? R might seem stable only because it carries along with it 50 years of history of programming languages (part of it's charm, where else can you see the generic function approach to OOP in a language that's still evolving?)

Finally, as someone who wrote a lot of R pre-tidyverse, I've seen the entire ecosystem radically change over my career.


The pandas workflows have also been stable for the last decade. That there is a new kid on the block (polars) does not make the existing stuff any less stable. And one can just continue writing pandas for the next decade too.


Outside bioconductor or the tidyverse in R can be just as unstable due to CRAN's package requirements.


The red queen effect. We'll end up all driving flood-lit bulldozers.


The most egregious example is old CGA games that were written to work on composite monitors. Without the composite display, they appear monochrome or cyan and magenta.

https://user-images.githubusercontent.com/7229541/215890834-...

It blew my mind when I finally learnt this, as I spent years of my childhood playing games that looked like the examples on the left, not realising the colours were due to the RGB monitor I had.


Oh you’re blowing my mind right now, played lots of CGA games with neon colours as a kid. What did they look like on a composite monitor?

Also, are you able to tell me the name of the game in the second row in that screenshot?


I hate that languages have become fads. The concepts have not changed, but there is a constant churn of languages.

I don't have to relearn natural language every 5-10 years, but for some reason I'm expected to when it comes to programming.


>Never underestimate the value of play!

I don't. It's my boss who doesn't see the value :(


For those wondering, here's the excerpt:

"There's been a big debate about the very use of the word Anglo-Saxon, so much so that people in our field of scholarship are not using the word [...] because of the connotations that Anglo-Saxon now has with the far right"

Soon we won't be able to use the word "right" to indicate direction. It'll be left and not-left.


The point made is a perfectly sensible one if you include a bit more context:

> Amid all the battles and conquest, however, Æthelstan brought a cosmopolitan flair to his new kingdom. Today there is a tendency – particularly among the far right – to depict early England as being cut off from the rest of Europe, and homogeneous in its cultural makeup. In truth, the newly formed kingdom of England was an outward-looking society.

> "There's been a big debate about the very use of the word Anglo-Saxon, so much so that people in our field of scholarship are not using the word anymore, and are going towards Early Medieval instead because of the connotations that Anglo-Saxon now has with the far right," says Woodman. "When [the term Anglo-Saxon] is invoked by the far right, they're thinking of it as very one-dimensional – people from one background in England in the 10th Century."

> In fact, that's a big misconception of what the period was like. "It was actually a very diverse place in early 10th Century. I always think about Æthelstan's Royal Assemblies, and there were people there from lots of different kingdoms within England, Britain more widely, from Europe. They were speaking a multiplicity of languages, Old Welsh, Old Norse, Old English, Latin. I just feel [the term Anglo-Saxon] is used without thinking, and without factual detail about the early 10th Century."

> Downham agrees. "There was a lot of cultural variety in the area we call England today. There wasn't this English monolith that started in 500AD."


Same as it ever was... The first Homo sapiens found Homo neanderthalis in Europe, and said, "Ooh, nice art! Wanna fuck?".

There never was an age of a "pure race", however you try to define it.


We're a mutt species, inbreeding frequently, all descendants of Australopithecus, hooking up with second and third cousins.

It's incest all the way down.


No doubt even back then there were homos on both sides of the equation clutching their pearls and wailing about racial purity.

(While not being so principled that they would put any effort into avoiding the benefits offered by such blending.)


Afaict, a good part of racism comes from the fact that human brain automatically labels people different from you as enemy. I assume this was a very useful strategy for millenia, if you saw a person not belonging to your tribe/nation that probably meant armed conflict a decent chunk of the time.

This applies to any kind of difference: clothing, culture, neurodivergence etc. I live in a country with zero racism against black people (all the racism is reserved for immigrants which only differ by culture, not color) yet you can tell black people are still a bit left-out in many social settings.

Due to black population being very low, people rarely ever see a black person here. When they see one, for the brain it's an unknown/unfamiliar, and that means potential danger at the subconscious / first impression level. To put it simply, people get weirded out. Ofc, once you get to know them that goes away.

(This is not to condone conscious racism: goyim, plantation workers and all.)


From what I've been told by friends working within several Universities, the only people that avoid "anglo saxon" are fanatics of another kind. Its all very childish.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: