Ha made me chuckle. For those wondering seriously about this, it’s not a viable optimization because weights are not readily compressible via JPEG/DCT, and there are a limited number of these units on the chip which bottlenecks throughout, meaning speed is dwarfed by simply reading uncompressed weights from HBM.
Good fun. Now I wish RT cores would be programmable with some form of PTX, but for now it's Optix or die. Managed to do fun stuff with it but it's like pulling teeth.
I won an GPU hackathon back in 2019 doing something very similar to this; although the other way around, I was compressing weights using hardware modules.