Depends on what you're doing. The main issue here is reductions / accumulations.
That is, if you have a bunch of floats like:
float sum = 0.f;
for (int i = 0; i < N; i++) {
sum += x[i];
}
and that you vectorize that to something like (typing in this comment, errors are likely):
__mm256 sum_8wide = _mm256_setzero_ps();
for (int i = 0; i < N/8; i++) {
sum_8wide = _mm256_add_ps(sum_8wide, _mm256_load_ps(x[8*i]);
}
// Now sum up the 8 values to get the final sum
float sum = _mm256_hadd_ps(...);
that will result in a different accumulation than if you did those are 4-wide and then a reduction. The usual solution is to either use the lowest common denominator (e.g., use AVX instead of AVX2) or more performance oriented, use the 4-wide SIMD units on ARM to "emulate" an 8-wide virtual vector (~15 years since I wrote NEON... and again, this is in a comment):
float32x4_t sum_lo = ; // zero;
float32x4_t sum_hi = ; // zero;
for (int i = 0; i < N/8; i++) {
sum_lo = vaddq_f32(sum_lo, vload1q_f32(x[8*i));
sum_hi = vaddq_f32(sum_hi, vload1q_f32(x[8*i + 4]));
}
// Reduce the sum in the same order
You would want to write a "virtual SIMD" wrapper library, so you don't do this manually in lots of places.
That is, if you have a bunch of floats like:
and that you vectorize that to something like (typing in this comment, errors are likely): that will result in a different accumulation than if you did those are 4-wide and then a reduction. The usual solution is to either use the lowest common denominator (e.g., use AVX instead of AVX2) or more performance oriented, use the 4-wide SIMD units on ARM to "emulate" an 8-wide virtual vector (~15 years since I wrote NEON... and again, this is in a comment): You would want to write a "virtual SIMD" wrapper library, so you don't do this manually in lots of places.