Average tensor unbind times vary drastically depending on implementation of unrelated code flows

I am running into a strange issue in my code base. The code is complicated but essentially I was noticing that calling unbind(0) on tensors took longer in one commit vs another, and it was noticeably slowing down performance. The old commit has an average per-call runtime of 8.727e-5, while the new commit has an average per-call runtime of 2.36e-4.

I was playing around with unbinding and I noticed that something seems to be getting cached… subsequent calls to unbind(), even on different tensors, seems to be consistently faster than the first call, no matter which tensor is put first.

The new code could potentially be messing with whatever cache exists and making the unbind call slower, but I am uncertain how to build around this. Is it possible to preserve the cache somehow? Are there any devs who might have insight onto what’s going on here?

I don’t know how you are profiling your code and if you are using a GPU, but note that CUDA operations are executed asynchronously. If you are using host timers you would need to synchronize the code before starting and stopping the host timers.

try using pytorch profiler ?

Yep, seems like the issue is due to unbind forcing a sync, and my code in the newer commit was faster making the forced sync require more waiting time. Thanks. I have tried avoiding this where possible (although it is worthwhile to note that calling unbind(0) is still faster than creating a view for each index of the tensor’s first dimension, at least for my use case)