How do I know randperm is performed on GPU

Hi, I am wondering how PyTorch ensures Ops like arange or randperm which creates Tensors by taking scalars are running on GPU. Because they don’t have a context where for some op the input tensors can provide such a device context or parametric op whose parameters can provide it as well.

I’ve run the following and the output is CPU. I know that because by default the output tensor will be transferred to CPU. I want to know whether the generation process is conducted on GPU. And it seems that Ops like these are exported directly from cpp.

class _test_gpu_(torch.nn.Module):
    def __init__(self):
        super(_test_gpu_, self).__init__()
        self.f=torch.nn.Linear(4,4)
        new_Wight = torch.Tensor(np.ones([4,4])) 
        self.f.weight = torch.nn.Parameter(new_Wight)

    def forward(self,x):
        ind=torch.randperm(4)
        print(ind.device)
        y=self.f(x)   
        return  y[ind]
t= _test_gpu_().cuda()
x=torch.Tensor(np.ones([1,4])).cuda()
res=t(x)

You can specify the desired device in torch.ramdperm and check the used CUDA kernel e.g. via:

nsys nvprof python -c "import torch; torch.randperm(10, device='cuda')"

Output:

CUDA Kernel Statistics:

 Time(%)  Total Time (ns)  Instances  Average  Minimum  Maximum                                                  Name                                                
 -------  ---------------  ---------  -------  -------  -------  ----------------------------------------------------------------------------------------------------
    42.1            7,425          1  7,425.0    7,425    7,425  void at::cuda::detail::cub::DeviceRadixSortSingleTileKernel<at::cuda::detail::cub::DeviceRadixSortP…
    16.3            2,880          1  2,880.0    2,880    2,880  void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4…
    14.3            2,528          1  2,528.0    2,528    2,528  void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…
    13.8            2,432          1  2,432.0    2,432    2,432  void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(c10::Sca…
    13.4            2,368          1  2,368.0    2,368    2,368  void (anonymous namespace)::randperm_handle_duplicate_keys_kernel<int, at::native::(anonymous names…
1 Like