when’s the last time you wrote a parallel array traversal in C?
also, consider reading the linked post about how assembly instructions are no longer a good approximation of how your computer works: https://queue.acm.org/detail.cfm?id=3212479. in general, writing in a language that is not close to the hardware allows the compiler to adapt when the hardware changes; for example futhark has the ability to execute using either SIMD or GPUs precisely because it’s not over-determined by the source language. C ties processors to the model of the PDP-11, which hasn’t been manufactured for 30 years.
It's not especially difficult to write a parallel array sum in CUDA, which is C++ with a couple of keywords bolted on. Haven't done that in a bit, but I wrote a SIMD hsum not long ago without much difficulty either.
C was of course originally designed for the PDP-11, but neither the standard nor the implementations have assumed that anytime this century. It would be a quite a stretch to say that thread local storage, atomics, the weird restrictions on pointers to deal with segmented architectures, IEEE floats, and other "modern" additions have anything to do with PDP-11s. And obviously you can take C/C++ code and efficiently build it for a wildly different architecture, like you do every time you use a compiler (including NVCC).
I'm not even saying that C is the fastest possible language because it really shouldn't be. What I'm saying is that decades of HLL advocates saying that we just need a sufficiently smart compiler to beat C have failed to produce one. C-family languages remain the gold standard for performance, and there's not much that even reliably competes beyond Rust and Fortran. Fortran is also an interesting example of a "low level" language without many of the bad ideas of C that ends up not much faster these days.
also, consider reading the linked post about how assembly instructions are no longer a good approximation of how your computer works: https://queue.acm.org/detail.cfm?id=3212479. in general, writing in a language that is not close to the hardware allows the compiler to adapt when the hardware changes; for example futhark has the ability to execute using either SIMD or GPUs precisely because it’s not over-determined by the source language. C ties processors to the model of the PDP-11, which hasn’t been manufactured for 30 years.