Why indexing a tensor on GPU involves data transactions to CPU?

ptrblck · February 12, 2021, 11:26am

I agree with @googlebot what capturing the profiling information would be confusing and bad.

Note that your current code snippet uses a BoolTensor to index a, which will yield a variable sized output tensor (in your example you are using torch.ones_like, so all values would be returned).
This would call into nonzero, which needs to synchronize as seen here. Besides that, the origin of the to() op could be found in a profiler as already explained.
If you are using Nsight Systems, you could have a look at this post to see how to enable backtraces.