Depends on what you're doing. The main issue here is *reductions* / *accumulatio...

Depends on what you're doing. The main issue here is reductions / accumulations.

That is, if you have a bunch of floats like:

  float sum = 0.f;
  for (int i = 0; i < N; i++) {
    sum += x[i];
  }

and that you vectorize that to something like (typing in this comment, errors are likely):

  __mm256 sum_8wide = _mm256_setzero_ps();
  for (int i = 0; i < N/8; i++) {
    sum_8wide = _mm256_add_ps(sum_8wide, _mm256_load_ps(x[8*i]);
  }
  // Now sum up the 8 values to get the final sum
  float sum = _mm256_hadd_ps(...);

that will result in a different accumulation than if you did those are 4-wide and then a reduction. The usual solution is to either use the lowest common denominator (e.g., use AVX instead of AVX2) or more performance oriented, use the 4-wide SIMD units on ARM to "emulate" an 8-wide virtual vector (~15 years since I wrote NEON... and again, this is in a comment):

  float32x4_t sum_lo = ; // zero;
  float32x4_t sum_hi = ; // zero;
  for (int i = 0; i < N/8; i++) {
    sum_lo = vaddq_f32(sum_lo, vload1q_f32(x[8*i));
    sum_hi = vaddq_f32(sum_hi, vload1q_f32(x[8*i + 4]));
  }
  // Reduce the sum in the same order

You would want to write a "virtual SIMD" wrapper library, so you don't do this manually in lots of places.