Hacker Newsnew | past | comments | ask | show | jobs | submit | janalsncm's commentslogin

Blacklisting someone for sending a polite cold email to the CEO is bananas. No company worth working for will do this. Worst case is they will ignore you.

In that case it seems there is good upside and minimal downside. The upside is high chance of getting an interview. The downside is reducing your already very low baseline probability closer to zero.

This is a lot of words to say: you have nothing to lose by doing this.


Transformers are just a special kind of binary which are run by inference code. Where the rubber meets the road is whether the inference setup is deterministic. There’s some literature on this: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

I don’t think the issue is determinism per se but chaotic predictions that are difficult to rely on.


I agree they could be chaotic but I think that’s an important distinction

It’s possible that they could be using fallback models during peak load times (west coast mid day). I assume your traffic would be routed to an east coast data center though. But secretly routing traffic to a worse model is a bit shady so I’d want some concrete numbers to quantify worse performance.

To be clear, the company has very directly denied doing this.

They did yes, but should we trust them?

I remember clearly this problem happening in the past, despite their claims. I initially thought it was an elaborate hoax, but it turned out to be factually true in my case.


I tend to think it would be very hard and very risky for large, successful companies to systematically lie about these things without getting caught, and the people who would be doing the lying in this case are not professional liars, they’re engineers who generally seem trustworthy. So yes, if there is a degradation, I think bugs are much more likely than systematic lying.

The TPU implementation used approximate top-k instead of the exact used on nvidia. While that wouldn't matter too much and there was a bug with it, it still was a cost savings thing not to use exact from the beginning because it wasn't efficient on TPUs which they were routing to under load. So it was a bit of a model difference under load, even aside from the bug.

To the extent this is an accurate characterization (somewhat, I think), they considered the quality difference a bug and fixed it!

In general, I think when we are evaluating fuzzy things like this we should come up with specifications for what we would like to see before performing the eval. Not saying it happened here, but very often I see people impressed with “answer-shaped” answers rather than objectively assessing the actual quality. The latter is harder and requires specific expertise.

It is probably a good lesson on how far confidence can get you in life. People are often highly biased by the presentation of the thing.


Excited to see more hardware competition at this level. Models that can run on this amount of RAM are right in the sweet spot of small enough to train on consumer-grade GPUs (e.g. 4090) but big enough to do something interesting (simple audio/video/text processing).

The price point is still a little high for most tasks but I’m sure that will come down.


This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.

If you look at the list, you'll see that you're incorrect, as C and JavaScript are not at the top.

Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.


I imagine that having to write

  for (int index = 0; index < size; ++index)
instead of

  for index in 0...size
eats up a lot of tokens, especially in C where you also need this construct for iterating over arrays.

Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.

the most efficient languages are pretty unpopular, so this argument makes them even more efficient in reality?...

You could, but you wouldn't when those keywords can all change in equivalent contexts.

The BPE or wordpiece tokenization algorithm will greedily take the longest valid token prefix. So if your text starts with “public static void main” it will try to find the longest token which matches that prefix. Even if “public” is a token, it will prefer to tokenize “public static” together.

yes, but then you have both alternatives as tokens, which nullifies GP's argument

What do you mean?

`public` might have a token by itself, even though you can have `pub` occurring in other contexts, too.


I meant that it wouldn't be efficient to agglomerate tokens in that way and that's why the system won't do it

If I had to steelman Dell, they probably made a bet a while ago that the software side would have something for the NPU, and if so they wanted to have a device to cash in on it. The turnaround time for new hardware was probably on the order of years (I could be wrong about this).

It turned out to be an incorrect gamble but maybe it wasn’t a crazy one to make at the time.

There is also a chicken and egg problem of software being dependent on hardware, and hardware only being useful if there is software to take advantage of its features.

That said I haven’t used Windows in 10 years so I don’t have a horse in this race.


> There is also a chicken and egg problem of software being dependent on hardware, and hardware only being useful if there is software to take advantage of its features.

In the 90s, as a developer you couldn't depend on that a user's computer had a 3D accelerator (or 3D graphics) card. So 3D video games used multiple renderers (software rendering, hardware-accelerated rendering (sometimes with different backends like Glide, OpenGL, Direct3D)).

Couldn't you simply write some "killer application" for local AI that everybody "wants", but which might be slow (even using a highly optimized CPU or GPU backend) if you don't have an NPU. Since it is a "killer application", very many people will still want to run it, even if the experience is slow.

Then as a hardware vendor, you can make the big "show-off" how much better the experience is with an NPU (AI PC) - and people will immediately want one.

Exactly the same story as for 3D accelerators and 3D graphics card where Quake and Quake II were such killer applications.


They are still including the NPU though, they just realised that consumers aren't making laptop purchases based on having "AI" or being branded with Copilot.

The NPU will just become a mundane internal component that isn't marketed.


To be fair Ollama does have a GUI.


Maybe there are also engineers at Google who saw the thread yesterday and wanted to help out? I agree that companies are self-serving, but (for now) they’re made of people who are not.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: