Dithering is very, very freaking cool. You can do it with any discretely-binned ...

Dithering is very, very freaking cool.

You can do it with any discretely-binned parameter/value/thingie-ma-bobber that must represent a continuous value.

This includes machine learning parameters!

This is something that I've been trying to get the word out about. A rule of thumb that's worked really well for me is "Almost never use a fully discrete approximation of a continuous process if you can get as close to the continuous process as possible."

One very pertinent case is in virtually-continuous batchsizes. You can trivially dither back and forth between the nearest rounded X microbatch (or full minibatch) size using a simple Bernoulli (i.e. a 0-1 weighted coinflip) distribution when doing batchsize growing, which oftentimes happens during LLM training. This averages out temporally (which you'd see if you took, for example, the running exponentially-averaged mean of the value, for example) if you run it for a really long time and seems to be strongly superior to just staying hard-locked at the nearest quantized bin (which sorta makes sense to me).

If you look at it from an information-theoretic perspective, you're communicating more information via discretely-emitted tokens with the dithering process about the underlying continuous variables and thus trivially we can deduce that it must have a higher inherent performance ceiling to it.

I use this in one of the projects that I've worked on that's out in the public, but I really need to tighten it up as dynamic batchsize growing is still a new subdiscipline that is still very much in its infancy and strongly looked-over IMO by a number of folx. Take a look into this method please if you're interested and ping me if you ever have any questions, please!

Happy to answer any questions and to talk more in detail about this topic, this is an interesting topic to me and hoping to get more people to use dithering in more places (not just in Machine Learning, I feel/hope/etc!!!)! Just sort of reminded me of this.

I'm also very interested in the implications of structured dithering for discrete approximations of what is inherently a continuous parameter in an ML setting, as the implications are autoregressive, and I have this fear that randomness is really perhaps the only clean way to avoid some sort of stacking "echo effects" where (high dimensional I'd assume in this particular case) oscillations happen in a very unintentional kind of way (which happens surprisingly often when noise or truly random sampling is not used appropriately....).

In any case, curious to hear people's thoughts, this is an interesting topic to me.