I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):
Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).
My guess that it's happening mostly due cache conflicts. With 1024 for a simplified L1 with 32kb you can fit exactly 8 lines of the inner dimmension in the cache, which means that (0,8,0) would have the same cache location as (0, 0, 0), which is bad for tiling
You can write custom CUDA kernels, and I've written a few to support operations over our ragged format. Actually Thinc makes it pretty easy to optimise a specific bit of code with a custom op...cupy's fused decorator can also work well in some situations.
What you don't get is the compile-time auto-optimisation. Like, there's no asynchronous dispatch like you would get from PyTorch.
If you write a chunk of operations as just cupy maths, and then write the same thing in PyTorch and use the PyTorch wrapper, you can expect the PyTorch one to perform better. You would also need to write the backprop callback for the cupy maths you did. Sometimes you might find optimisations PyTorch doesn't though, especially around speed vs memory trade-offs.
Part of the philosophy and difference between this and other frameworks is that we do not do any compilation or trickery of any sort: what you write is what gets executed. Obviously this is slower a lot of the time, but it means we can play well with others --- we're not wrestling for control of more of the graph so we can make more optimisations, and we're not limiting the hand optimisations you can do for custom situations.
Here's an example of how I've wrapped CUDA kernels, using Cupy's RawKernel feature. Most people do these as strings within Python source, but I find that super ugly. I like to keep the cuda source in .cu files, and then read in the file to compile it.
* The wrappers are called by the CupyOps object: https://github.com/explosion/thinc/blob/master/thinc/backend... . This object has the same API across backends, with some functions redefined with backend-specific implementations. In the NumpyOps object, I instead call into custom Cython code.
My CUDA skills aren't great, so I'm sure there are improvements that could be made. I'd welcome suggestions if anyone has them.
Was it a university like ETH Zürich, where anyone can get in but there is a limited number of spots from the second year onwards? Then it really made sense to suggest that