Hacker Newsnew | past | comments | ask | show | jobs | submit | kmbriedis's commentslogin

I believe the ML community will strongly disagree. CUDA is everything


Because the academic ML community does not care about shipping product to end users not equipped in nVidia.


Except SYSCL also works on AMD and Intel, and also has a CUDA backend, but apparently you missed that part.

In what concerns commercial uses of CUDA, Hollywood doesn't seem to have any problem with it, nor the car manufacturers with Jetson.


SYSCL might be an option, but it doesn't seem to have much in the way of adoption(Forum is dead)which is concerning.

It does look like Intel is supporting at least, so maybe in the future it will be a good option.


The most surprising aspect of this is that it's a solo paper accepted for a major conference!


There's another case of reverting the codebase to an earlier version just to run it


As a side note - loop unrolling has almost no effect, it just allows scalar replacements for latency bound instructions and some other stuff


But it can be 20x


I actually did some benchmarking on this once upon a time[1] and it turns out that it's actually really easy to write something that is only ~5x slower[2] (until you get to matrices that don't fit in RAM):

           (4, 4, 4) (5, 5, 5) (32, 32, 32) (33, 33, 33) (256, 256, 256) (257, 257, 257) (512, 512, 512) (513, 513, 513) (1024, 1024, 1024) (1025, 1025, 1025)
  –––––––– ––––––––– ––––––––– –––––––––––– –––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– ––––––––––––––– –––––––––––––––––– ––––––––––––––––––
  :naive         0.0       0.0       1.3e-5       2.0e-5          0.0114          0.0133          0.0942           0.106               3.25               2.39
  :tiled         0.0       0.0       2.7e-5       2.2e-5          0.0139          0.0121           0.154           0.101               1.25              0.888
  :fastf77       0.0       0.0       8.0e-6       8.5e-6         0.00543         0.00563          0.0426          0.0445              0.437              0.448
  :blas       4.5e-6    4.0e-6       1.9e-5       2.1e-5        0.000972         0.00109         0.00712         0.00744             0.0582             0.0607
(Units are seconds per multiplication.)

Obviously OpenBLAS is so easy to package that it's not really worth avoiding it, but it was very eye-opening to see just how easy it is to get within an order of magnitude (easier, in fact, than getting into the 10x-20x range).

[1]: https://gist.github.com/Sean1708/69c5694048e9a9ca7bd84fcbc9e...

[2]: 8-core 3.4GHz Haswell i7 with 32kB L1, 256kB L2, 8MB L3, and 8GB RAM.


Why is (1025, 1025, 1025) so much faster than (1024, 1024, 1024)?


My guess that it's happening mostly due cache conflicts. With 1024 for a simplified L1 with 32kb you can fit exactly 8 lines of the inner dimmension in the cache, which means that (0,8,0) would have the same cache location as (0, 0, 0), which is bad for tiling


So no custom CUDA kernels, just CuPy? Isn't that a performance issue? (Based on installation notes)


You can write custom CUDA kernels, and I've written a few to support operations over our ragged format. Actually Thinc makes it pretty easy to optimise a specific bit of code with a custom op...cupy's fused decorator can also work well in some situations.

What you don't get is the compile-time auto-optimisation. Like, there's no asynchronous dispatch like you would get from PyTorch.

If you write a chunk of operations as just cupy maths, and then write the same thing in PyTorch and use the PyTorch wrapper, you can expect the PyTorch one to perform better. You would also need to write the backprop callback for the cupy maths you did. Sometimes you might find optimisations PyTorch doesn't though, especially around speed vs memory trade-offs.

Part of the philosophy and difference between this and other frameworks is that we do not do any compilation or trickery of any sort: what you write is what gets executed. Obviously this is slower a lot of the time, but it means we can play well with others --- we're not wrestling for control of more of the graph so we can make more optimisations, and we're not limiting the hand optimisations you can do for custom situations.

Here's an example of how I've wrapped CUDA kernels, using Cupy's RawKernel feature. Most people do these as strings within Python source, but I find that super ugly. I like to keep the cuda source in .cu files, and then read in the file to compile it.

* The CUDA kernels: https://github.com/explosion/thinc/blob/master/thinc/backend...

* The code that calls cupy.RawKernel and the wrapping functions: https://github.com/explosion/thinc/blob/master/thinc/backend...

* The wrappers are called by the CupyOps object: https://github.com/explosion/thinc/blob/master/thinc/backend... . This object has the same API across backends, with some functions redefined with backend-specific implementations. In the NumpyOps object, I instead call into custom Cython code.

My CUDA skills aren't great, so I'm sure there are improvements that could be made. I'd welcome suggestions if anyone has them.


People would probably find out what hardware they use for benchmarks and optimize for that, leading to performance decrease for many othes


One could argue if those can be called "technically knowledgeable"


No one can be knowledgeable about all topics, which is precisely why trust is so important.


It's okay for knowledgeable people to not know things.


Was it a university like ETH Zürich, where anyone can get in but there is a limited number of spots from the second year onwards? Then it really made sense to suggest that


Same here. I have taken many Calculus and Algebra classes, SAT Math level 2 and have no clue why these graphing calculators are necessarily...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: