More

bicsi · 2025-08-11T03:52:22 1754884342

What if I told you that one can model bidirectional attention just by recurring over causal attention, and it’s still fast enough? Hint: It’s called chain of thought.

I strongly believe it’s time to discontinue diffusion models, solely on the fact that iterated auto-regression is faster, more parallelizable, and just as potent with proper prompting techniques (of course, unless you consider CoT as a form of diffusion, which it essentially is).

fancyfredbot · 2025-08-11T08:53:23 1754902403

I'd respectfully suggest that it's perhaps not time to "discontinue diffusion models". Minsky and Papert set AI back by decades by suggesting neural networks were a dead end which couldn't learn XOR. There's not a chance of an HN comment having the same effect of course but my point is that it's easy to dismiss things prematurely.

smus · 2025-08-11T04:34:07 1754886847

Can you explain how CoT is a form of diffusion or models bidirectional attn?

SalmoShalazar · 2025-08-11T12:01:06 1754913666

Chain of thought is not a form of diffusion. Diffusion models clearly have characteristics that are useful and worthy of further research and should not be “discontinued“

bicsi · 2025-02-24T12:15:50 1740399350

Wait, so frameset is basically htmx?

tredre3 · 2025-02-24T16:45:43 1740415543

I've been doing web development on and off for 30 years and in no other programming field have I seen this perpetual habit of rediscovering the wheel.

hbn · 2025-02-24T18:53:30 1740423210

Hey, we may be spinning in circles, but at least the websites consistently get slower over the years

jmg_ · 2025-02-24T15:05:32 1740409532

I’m having the same realization regarding Turbo.

Always thought it was a creatively simple solution, but had no idea the pattern had existed (and was used) for so long.

recursivedoubts · 2025-02-24T21:46:45 1740433605

always has been (hx-target was taken directly from the target attribute on anchors, innerHTML as the default swap taken from how iframes behave)

bicsi · 2025-01-31T07:58:58 1738310338

This is just BS.

If you follow their paper, their "inductive" argument can also prove that the result is optimal (the sqrt(2) constant is taken from thin air and may also be replaced by 1 without changing the proof).

My guess is that they tested some graphs, and "empirically" saw a bound of sqrt(2), and then came up with a proof to self-fulfill the profecy. They even mention that "for very small graphs, the approximation ratio may briefly exceed sqrt(2)", which is just nuts, given that they have a formal proof that it should never happen.

In case anyone read the paper and can't figure out the mistake in the proof, they assume that OPT_T + OPT_R = OPT, which is obviously false.

adrianN · 2025-01-31T08:51:33 1738313493

It's from the same author that recently made "a breakthrough" triangle counting algorithm and has "strong evidence for P=NP".

bicsi · on Oct 18, 2024

They built a dumber clone of Stockfish, and they call it ‘zero’ for some reason. What is the meaning behind ‘zero’ anyways? It used to refer to zero-shot, but now it seems like it’s just a marketing term.

janalsncm · on Oct 18, 2024

I assume you’re referring to AlphaZero and Leela Chess Zero.

AlphaZero was the successor to AlphaGo. AZ was notable because unlike AG, it used zero human games to learn to play: it just played games against itself. Typically in “supervised” machine learning you take human data and train a model to imitate it. AZ used zero human data to learn.

Leela Chess Zero started out as an open source copy of AZ but it’s probably better than AZ now.

pertymcpert · on Oct 18, 2024

Zero doesn't mean zero shot learning. It was coined by Deepmind for AlphaGo Zero where they used zero human input into the training data. It was trained entirely by playing against itself.

hilux · on Oct 18, 2024

And they gave Sir Demis a Nobel Prize!

ralegh · on Oct 18, 2024

I assume it's 'zero' turns lookahead/search, i.e. only look at the current board state.

bicsi · on June 27, 2024

I don’t even know where to begin.

From phrases like ‘deep learning is what makes AI possible’ to a ‘practice guide’ that references Python and Pytorch and Transformers, amongst other things, and apparently a 4-hour contest format, this can only be taken as a joke.

I think that a more fitting name for the contest should is IOAPI, as it’s probably all that a student could do in 4 hours.

Or the ‘International Olympiad of asking your parents to give you money for compute’, depending on the format, whichever you like best.

I rest my case.

bicsi · on March 3, 2024

One of my favourite data structures is sorting the strings lexicographically. It allows prefix search via binary search, it is memory efficient like nothing else, and 100% of the times faster than a Trie in practice. Also, who wants to limit their APIs to searching through a dictionary only by prefix, anyways?

bicsi · on Dec 17, 2023

I find it a bit odd how this tree representation as a parents array (which is, by the way, I think the most basic representation in any CS course), got so much traction on HN. I think this goes to show how far a good presentation can drive a trivial idea. On top of that, it just casually presents suboptimal procedures for a lot of essential operations, without diving into too much details about the impact of the suboptimality. Good PR I guess…

bicsi · on Nov 14, 2023

As a competitive programmer, I’ve seen similar ‘magic’ tricks (including this one) here: https://github.com/kth-competitive-programming/kactl/blob/ma... (page 23, 10.5.1)

bicsi · on Oct 22, 2022

The story is much worse than what is presented in the article, especially when talking about floating point errors that add (or rather multiply) up.

More often than not, the error is relative wrt the greatest magnitudes in the intermediary calculations. In essence, if you subtract two floating point numbers, you’re kinda screwed, because you cannot ever handle with good precision cases like A - B, where A and B are big enough numbers. Not to mention more complicated operations like trig functions.

In my opinion, one should avoid floating point as much as possible. And not only when testing for equality (all comparisons suffer from this).

Or, of course, ignore FPEs and proceed at your own risk.

constantcrying · on Oct 23, 2022

This is not the issue. Floating point numbers have problems when A and B differ greatly in magnitude, then A-B might easily be equal to A and so ((A-B)-A)LARGENUMBER can still be equal to zero, even if B was e.g. 1.

This is not a problem with floats. This is an inherent* result of the design constraints. You can not fix it and there are no alternatives without this flaw which offer any of the same benefits as floats.

Any operation between two floats is the floating point number which is closest to the result calculated as real numbers, that is the magic of floats.

Floating point numbers are a near magical type, which for many applications are not only a good choice, but the only one which makes any sense. It is near impossible to imagine modern engineering without them.

tomxor · on Oct 27, 2022

Indeed. The only way to accommodate numbers where the difference in order of magnitude exceeds the type's significand length is to not combine them, keep them separate and operate on them separately, combining them only for a sample.

One good common example is numerical integration. In any sufficient fine grained posteriori simulation, even with modest limits on position - delta velocity can be too small to preserve when adding to position, i.e when delta velocity is more than 2^52 smaller than position. Keeping a separate accumulator is the only way to handle this without arbitrarily increasing precision with software FP.

bicsi · on Oct 10, 2022

Isn’t this just a trie with only two levels? It seems to me that this technique isn’t particularly unknown, as it’s taught in most elementary CS courses.

williamkuszmaul · on Oct 10, 2022

I think it would be fair to say that it's a kind of funny trie-hash-table hybrid. What's neat though is that it manages to achieve better space bounds than either a trie or a hash table would on their own.

I didn't invent the algorithmic idea --- it's basically a simplification of a data structure that was published in ICALP 2003 [1]. As far as I know, the idea of combining hash tables and tries in this kind of way (despite being incredibly simple!) doesn't appear in either the theoretical or experimental literature prior to that.

I'd be interested in any other earlier sources that people might have.

[1] Raman, Rao. Succinct Dynamic Dictionaries and Trees. https://link.springer.com/chapter/10.1007/3-540-45061-0_30. ICALP, 2003.

fargle · on Oct 10, 2022

not quite, as the 2nd level is a hash table which is very un-trie-like. it shares the idea that you don't store the implicit bits known by the address of the first level like a trie.