ATEN_CPU_CAPABILITY=avx2, adding two float tensors is slower than
ATEN_CPU_CAPABILITY=default, assuming a high value for the number of threads.
eg. On a machine with 32 physical cores & 64 logical cores, with 16 threads,
ATEN_CPU_CAPABILITY=avx2 is slower than
For just one thread, adding two float tensors is only around 10% faster with
What’s the reason behind this? Both of them are memory-bound.