Basically I just don't want to hear about "the state of SIMD in Rust" unless it is about dramatic improvement in autovectorization in the rust compiler.
80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized. Intrinsics get you the rest of the way on harder code. Autovectorization is essentially a solved problem for the vast majority of floating point code.
Not so with rust, because of a dogmatic approach to floating point arithmetic that assumes bitwise reproducibility is the "right" answer for everyone (actually, it's the right answer to almost nobody) to the point of not even allowing a user to even flag on these optimizations. and once you get to the point of writing intrinsics you have to handwrite code for every new architecture when autovectorizers could have gotten you 80%-90% of the way there with a single source and often this is just enough.
the contention with the above is that if a user needs SIMD they can just use some SIMD API and make their intention more clear. this is essentially an argument that we should handwrite intrinsics. well guess what. I'm a programmer and I use compilers because they _do this for me_ and indeed are able to do so very easily in C or C++ when I instruct it that I'm ok with with reordering operations and other "accuracy impacting" optimizations.
The huge joke on us is that these optimizations generally have the effect of _improving_ accuracy because it will reduce the number of rounding steps either by simply reducing the number of operations or by using fused multiply adds which round only once.
>80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized.
Yep. I was pleasantly surprised by the autovectorization quality with recent clang at work a few days ago. If you write code that the compiler can infer to be multiples of 4, 8, etc the compiler goes off and emits pretty decent NEON/AVX code. The rest as you say is handled quite well by intrinsics these days.
Autovectorization was definitely poorer 5-10 years ago on older compiler toolchains.
Yikes. Sounds like we need this in rust ASAP. (I do a lot of parallizable code; GPU-centric, but CPU-SIMD is a good fallback for machines that don't have nvidia GPUs). I find the manual SIMD packing/unpacking clumsy, especially when managing this in addition to non-SIMD CPU, and GPU code.
Basically I just don't want to hear about "the state of SIMD in Rust" unless it is about dramatic improvement in autovectorization in the rust compiler.
80%-90% or so of real life vectorization can be achieved in C or C++ just by writing code in a way that it can be autovectorized. Intrinsics get you the rest of the way on harder code. Autovectorization is essentially a solved problem for the vast majority of floating point code.
Not so with rust, because of a dogmatic approach to floating point arithmetic that assumes bitwise reproducibility is the "right" answer for everyone (actually, it's the right answer to almost nobody) to the point of not even allowing a user to even flag on these optimizations. and once you get to the point of writing intrinsics you have to handwrite code for every new architecture when autovectorizers could have gotten you 80%-90% of the way there with a single source and often this is just enough.
the contention with the above is that if a user needs SIMD they can just use some SIMD API and make their intention more clear. this is essentially an argument that we should handwrite intrinsics. well guess what. I'm a programmer and I use compilers because they _do this for me_ and indeed are able to do so very easily in C or C++ when I instruct it that I'm ok with with reordering operations and other "accuracy impacting" optimizations.
The huge joke on us is that these optimizations generally have the effect of _improving_ accuracy because it will reduce the number of rounding steps either by simply reducing the number of operations or by using fused multiply adds which round only once.