Implicit copy to/from GPU

Hi,

I am using topk on a cuda tensor and it is quite slow, so I suspect behind the scenes it copies the tensor to CPU and performs the topk over there then returning the result to GPU.

How can I affirm if this is the situation?
More generally, is there any indication in documentation as to which pytorch functionals will go forth and back GPU<>CPU so that I can avoid using them ?

Thanks,
Moshe

Hi,

No operation will move data implicitly between the cpu and the gpu in pytorch.

Why do you think topk is slower than it should be?

Hi,

I added 3 operations to my code (max/min, topk, le/gt). Besides the topk the other 2 are also used in other locations in my code, though once I added the topk function training became ~x4 slower.

If topk does not move data implicitly to cpu, then how can I speed it up ? Is there any way to check the implementation of topk in pytorch to see how it distributes the jobs between threads ?

Thanks,
Moshe

Sure, here is the entry point for the c call that launch the kernel. And here is the kernel implementation.

What is the size of the inputs and the exact params? Are you sure that the fact that your return k elements instead of 1 if you were doing a max makes the amount of work done by the whole code different?