Performance regression in advanced indexing

Hello people, thanks for the amazing library. I am working on a simulation package which uses pytorch mostly as a numpy replacement, so not an ML use case. We found performance regressions in the advanced indexing going from 1.10.1 to 2.0.1.The torch versions are pip installed. We tested x86_64 and arm.

The python benchmark is attached. Each test was run more than 10k times (after a warmup of 10k iterations) and the minimum from each run was selected. All tests are basically indexing operations into arrays for reading or writing. In comparison we also tested the same operations using numpy. Tensor size is (1000,) and a random bitmask with 500 true entries was generated, which was used for all calculations.

pytest --randomly-seed=1 --benchmark-verbose --benchmark-only --benchmark-disable-gc --benchmark-warmup=on --benchmark-warmup-iterations=10000 -m torchperformance


How does the mean look after performing proper warmup iterations?

So using the average instead of the min we got this result. The warmup was 10k. Do you have any experience what kind of warm-up is sufficient?