You might be able to understand what a convolutional network "is" without calculus, but you'll be woefully unequipped to ask even obvious questions like "what if we put Fourier transforms around the convolutional layers" (a cursory search suggests it provides the expected speedup but is for some reason not a standard thing to do?). As someone outside of the industry, I'd also imagine any effort to explain what NNs are actually "learning" (or I suppose dually, how to design network architectures) is going to have a lot of fruitful overlap with signal processing theory, which is heavy on calculus, linear algebra, probability, etc.
> a cursory search suggests it provides the expected speedup
What do you mean? Processing a CNN layer takes an amount of time that does not depend on the input data, only the input/output sizes. Fourier transform is just a change of basis. Why should anything speed up?
Because convolution is an O(N^2) operation, but a Fourier transform (and its inverse) can be done in O(NlogN) and turns convolution into O(N) multiplication. So if you do FFT, multiply, Inverse FFT, you get convolution in O(NlogN). I would guess that you don't even need to do the inverse FFT and can just learn in frequency space instead, but maybe there's some reason why that doesn't work out.
Computing convolutions using FFTs is efficient for large kernels (or filters). Most convolutions in popular ML models have small kernels, a regime where it is typically more efficient to reformulate the convolution as a matrix multiplication.
I think your complexity argument is correct for N=pixels=kernel size. But typically, pixels>>kernel size.
Disclosure: I work at Arm optimising open source ML frameworks. Opinions are my own.
I see, the wording confused me. You don't really "put Fourier transforms around the convolutional layers" because you have to completely replace the convolutions.
This seems to be done in some cases. I guess it isn't done more widely because the "standard" convolution kernels are very small and the performance would actually be worse?