With ATEN_CPU_CAPABILITY=avx2
, adding two float tensors is slower than ATEN_CPU_CAPABILITY=default
, assuming a high value for the number of threads.
eg. On a machine with 32 physical cores & 64 logical cores, with 16 threads, ATEN_CPU_CAPABILITY=avx2
is slower than ATEN_CPU_CAPABILITY=default
.
For just one thread, adding two float tensors is only around 10% faster with ATEN_CPU_CAPABILITY=avx2
.
What’s the reason behind this? Both of them are memory-bound.
Thanks!