Why is float tensor addition on CPU slower for AVX2 than the default ATen CPU capability?

With ATEN_CPU_CAPABILITY=avx2, adding two float tensors is slower than ATEN_CPU_CAPABILITY=default, assuming a high value for the number of threads.
eg. On a machine with 32 physical cores & 64 logical cores, with 16 threads, ATEN_CPU_CAPABILITY=avx2 is slower than ATEN_CPU_CAPABILITY=default.

For just one thread, adding two float tensors is only around 10% faster with ATEN_CPU_CAPABILITY=avx2.

What’s the reason behind this? Both of them are memory-bound.

Could you create an issue on GitHub for this so that the code owners could have a look, please?

Resolved at On CPU, vectorized float tensor addition might be slower than unvectorized float tensor addition · Issue #60202 · pytorch/pytorch · GitHub.

Basically, memory allocation & zero-filling costs are worse for AVX2.

1 Like