I’ve recently been studying a very simple case where I index a CUDA tensor in GPU. As far as I know, the indexing operation is adapted for GPU execution with potential speedups regarding CPU.
In the small example below, I access elements in tensor
a according to the mask tensor
b. I have both the indexed tensor and the tensor of indices on GPU, and after running the script, I would expect no CPU involvement at all. I study the behavior of the code with a profiler:
import torch import torch.autograd.profiler as profiler a = torch.rand((10000)).to('cuda:0') b = torch.ones_like(a).bool() with profiler.profile() as prof: with profiler.record_function("CHAIN_FORWARD"): for i in range(10): a[b] print(prof.key_averages().table(sort_by='self_cpu_time_total', row_limit=5))
Here is the output of the profiling:
As you can see, the operation
to(), which transfers data to and from different devices, has been called 60 times. How comes? I am using PyTorch version 1.5.0