> Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention
This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic.
The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation.
You can find papers discussing "cubic" attention, i.e. each token gets to interact with each pair of other tokens, but always in very theoretical settings with single-layer transformers on contrived synthetic tasks.
Keep in mind that LLMs have many many layers, so they have plenty of opportunity to model higher-order interactions without needing to brute force every possible combination of 10 previous tokens, of which the vast majority will be useless. Empirically, even full "quadratic" attention is not always necessary, as evidenced by the existence of linear/sparse attention variants that perform almost as well.
If you remove the terms "self", "agency", and "trivially reducible", it seems to me that a classical robot/game AI planning algorithm, which no one thinks is conscious, matches these criteria.
How do you define these terms without begging the question?
If anything has, minimally, a robust spatiotemporal sense of itself, and can project that sense forward to evaluate future outcomes, then it has a robust "self."
What this requires is a persistent internal model of: (A) what counts as its own body/actuators/sensors (a maintained self–world boundary), (B) what counts as its history in time (a sense of temporal continuity), and (C) what actions it can take (degrees of freedom, i.e. the future branch space), all of which are continuously used to regulate behavior under genuine epistemic uncertainty. When (C) is robust, abstraction and generalization fall out naturally. This is, in essence, sapience.
By "not trivially reducible," I don't mean "not representable in principle." I mean that, at the system's own operative state/action abstraction, its behavior is not equivalent to executing a fixed policy or static lookup table. It must actually perform predictive modeling and counterfactual evaluation; collapsing it to a reflex table would destroy the very capacities above. (It's true that with an astronomically large table you can "look up" anything -- but that move makes the notion of explanation vacuous.)
Many robots and AIs implement pieces of this pipeline (state estimation, planning, world models,) but current deployed systems generally lack a robust, continuously updated self-model with temporally deep, globally integrated counterfactual control in this sense.
If you want to simplify it a bit, you could just say that you need a robust and bounded spatial-temporal sense, coupled to the ability to generalize from that sense.
I don’t know the post you’re referring to but I highly recommend How the Immune System Works by Lauren Sompayrac. It explains the interesting parts without getting bogged down in the details of every signalling pathway, but without dumbing things down too much.
Is the drone of a fan harmonic? I would’ve thought it’s more like a repetition pitch so its overtones would not be harmonic and would not exhibit a missing fundamental.
Agree with the broader point, just curious if there’s some interesting physics that creates a harmonic sound.
Overtones are about timbre, not harmony. The fan isn't playing a chord (well, probably not). But the tone the fan plays isn't a pure sine wave either. It will have overtones that are integer multiples of the fundamental that give it its characteristic sound.
It's the same reason that a flute and saxophone can play the same note but sound different. The fundamental is the same, but the amplitudes of the overtones are different.
> It will have overtones that are integer multiples of the fundamental that give it its characteristic sound.
What I’m wondering is why would the overtones go in integer multiples (I.e. be harmonic) for a fan? A flute and a saxophone have harmonic(ish) overtones because of the physics of a vibrating column of air
This is just math, not physics. Suppose you have a thing vibrating in a periodic manner. You might imagine that it will vibrate the air so that the sound pressure is some periodic function of time. Fourier transform that function to get a spectrum, and it will have discrete peaks at the fundamental frequency (1 / period) and at integer multiples of that frequency. You don’t even need to decompose into sine and cosine functions for this to work — all that’s really going on is that you have f(t) = f(t + period), and you’re turning f into the sum of a bunch of other functions g_1, g_2, etc, all of which have the same property that g_i(t) = g_i(t + period). Of course, if a function g has the property that, for all t, g(t) = g(t + period/n) for any integer n, then you can iterate that property n times and you’ll also have g(t) = g(t + period). And these functions with the fundamental period, half the fundamental period, one third the fundamental period, etc, are the fundamental tone and its overtones. You could decompose into square waves or just about anything else and you would get the same result.
(In any discussion of Fourier transforms complete with equations, you’ll usually see a bunch of factors of 2π because the frequencies are angular frequencies. This is done for mathematical convenience and has no effect on any of this.)
Your question displays sufficient knowledgeability of the phenomenon of strictly integer harmonics, that you can basically disregard the replies explaining how they could arise, whereas your question is more about given that it shouldn't arise inevitably, how do we know they would be harmonic given the case of a fan.
Some of the replies pontificate and assume sounds are periodic, and hence their harmonics must have been perfectly integral, which is of course totally bonkers.
Yes, some instruments are harmonic (i.e. integral harmonics down to ~ ppm frequency ratio errors) like violins, but only because those are bowed strings, resulting in phase locking.
Plucked strings are much further from integral harmonics, due to dispersion: yes standing waves for a frequency-independent wavespeed c on the string would give perfectly harmonic partials. Real strings show dispersion (a frequency dependent wavespeed) resulting in inharmonic partials.
Nothing indicates fan noise to be strongly harmonic. Their composite sound may have structured and repeatable (in)harmonic components in many ways, harmonics would be easiest to explain. The part that sounds "white" would presumably be hard to characterize and cancel.
If the fan has any recognizable pitch at all, it's because something periodic is happening. If it's loud enough to be annoying, there's probably some resonance going on to amplify it.
For example, maybe the motor spins at 120 Hz, and it's slight asymmetry shakes the chassis of the fan. That shaking will send waves across the body of the fan. Any of those waves whose wavelength is not an integer multiple of the size of the body will bounce around and end up destructively cancelling out. But the wavelengths that are at are close to integer multiples of the resonating frequency of the body will reinforce themselves as the bounce back and forth across the chassis and get amplified.
If you do an image search for "string overtones", you can get a picture of what I mean. Random physical objects aren't all strings, but many of them have at least a little plasticity and rigidity such that they can vibrate and resonate. When they do, the result will be harmonics at the object's fundamental frequency and integer multiples.
Other frequencies occur too. If you strike a bell, for example, that impulse will produce waves at basically all frequencies. It's just that the ones that don't resonate with the bell's fundamental will cancel themselves out and fade out nearly instantly (that's the clanky part of the very beginning of a bell sound). The multiples of the resonance frequency will ring out (the bell-like peal that decays slowly).
So are you saying the only difference between a woodwind instrument and a fan is the column?
What about a stringed instrument then?
Every sound found in nature contains multiple frequency components. When these align as integer multiples of the fundamental, they are harmonics; when they do not, they are inharmonic partials. Only a pure sine wave lacks them, and such signals don’t occur naturally.
A string fixed at both ends produces harmonic sounds because of its particular structure. In order to have a non-integer overtone the ends would have to move up and down, which by construction they can't. Similarly for wind instruments: the air stops at either end and is reflected back, and a non-integer overtone would require changing the length of the tube (or sticking holes in it to allow the pressure to go to zero at the hole, effectively creating an artificial "end" of the tube).
By contrast, a freely vibrating bar (not fixed at the ends) does not have harmonic overtones. To make the bars of a xylophone, marimba, or vibraphone sound nice, you have to cut out a little "scoop" shape from the bottom of the bar to force it to vibrate such that its overtones match up with integer multiples of the fundamental frequency of the bar.
As you say, most sounds in nature do not have a harmonic spectrum, so if a fan did I would find that surprising and interesting.
> why would the overtones go in integer multiples (I.e. be harmonic) for a fan?
The fan noise is from its own vibrations -- presumably driven by the motor. These vibrations will correspond to natural vibrating modes on the body of the vibrating object -- which could be the motor, or the chassis, or even possibly the fan blades. Whatever the shape, the natural modes will be naturally quantized into "harmonics". Those vibrating modes could have more nuanced spatial forms (eg. Bessel functions) but their temporal pattern would likely be sinusoid.
Please downvote this comment. But I had to say thanks for this. It's one of the litte glistening ornaments on the perennial HN xmas tree. Good thread (and post) altogether.
> This approach works by randomly polling participating devices for whether they’ve seen a particular fragment, and devices respond anonymously with a noisy signal. By noisy, we mean that devices may provide the true signal of whether a fragment was seen or a randomly selected signal for an alternative fragment or no matches at all. By calibrating how often devices send randomly selected responses, we ensure that hundreds of people using the same term are needed before the word can be discoverable. As a result, Apple only sees commonly used prompts, cannot see the signal associated with any particular device, and does not recover any unique prompts. Furthermore, the signal Apple receives from the device is not associated with an IP address or any ID that could be linked to an Apple Account. This prevents Apple from being able to associate the signal to any particular device.
The way I read this, there's no discovery mechanism here, so Apple has to guess a priori which prompts will be popular. How do they know what queries to send?
Later in the article, for a different (but similar) feature:
> To curate a representative set of synthetic emails, we start by creating a large set of synthetic messages on a variety of topics... We then derive a representation, called an embedding, of each synthetic message that captures some of the key dimensions of the message like language, topic, and length. These embeddings are then sent to a small number of user devices that have opted in to Device Analytics.
It's crazy to think Apple is constantly asking my iPhone if I ever write emails similar to emails about tennis lessons (their example). This feels like the least efficient way to understand users in this context. Especially considering they host an email server!
yeah, the linked paper [1] has more detail--basically they seem to start with a seed set of "class labels" and subcategories (e.g. "restaurant review" + "steak house"). They ask an LLM to generate lots of random texts incorporating those labels. They make a differentially private histogram of embedding similarities from those texts with the private data, then use that histogram to resample the texts, which become the seeds for the next iteration, sort of like a Particle Filter.
I'm still unclear on how you create that initial set of class labels used to generate the random seed texts, and how sensitive the method is to that initial corpus.
You could brute force it by querying about all 500k English words. With 1.3+ billion iPhone users, that means about 2600 users will see any goven word, which may be enough to observe trends.
No i think it's fairly well guaranteed that devices are encrypting and then submitting prompts. Differential encryption allows them to do honest-to-god work without decrypting the data. The "fragments" the polled devices are sent are probably some sub-sequence of the differentially encrypted prompt.
I think the main advantage is that you can compute the extra parameters (the PRNG seeds) from the network weights alone, whereas most other quantization methods require simulating the quantization procedure at training time (Quantization-Aware Training) or setting them from a calibration dataset (Post-Training Quantization)
> What makes this technique particular to LLM weights
This is my understanding as a non-expert.
LLM activations tend to be relatively sparse with large outliers. With linear quantization, this means you either have to clip off the outliers or you have to stretch your range to include the outliers, which wastes precious bits. Neither of these works well, so essentially all LLM quantization research is using various heuristics to get around these outliers. For example, you can do linear quantization but split the activations up into smaller blocks to make it less likely that any given block contains an outlier.
Another trick people have discovered (predates LLMs) is applying a random rotation/projection to the embeddings. This has the effect of making sure no one dimension in the vector dominates the others (which again hurts quantization). This works because in order for a single dimension to dominate, all the others have to "conspire" to be near zero. When you have 10,000+ dimensions, that's very unlikely.
This paper applies the latter trick. Instead of pre-generating the random projection matrices, they generate them on the fly on the accelerator from a seed that is fixed for each block. The seed is chosen from an offline brute-force search that needs only the weights of the network. This separates it from a lot of other quantization methods that either require calibration data or have to be simulated at training time so the network learns the quantization parameters itself.
You might think this is wasteful/might hurt performance, but it turns out that LLM inference is heavily memory-bound as it involves streaming a very large neural network into the accelerator (GPU/TPU/NPU/whatever) to operate on a relatively small amount of data, so there are lots of "free cycles" to generate these random numbers. Of course, if you care about power usage that might not be a great idea...
This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.
A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”
This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic.
The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation.
reply