Hi, I am wondering how PyTorch ensures Ops like arange or randperm which creates Tensors by taking scalars are running on GPU. Because they don’t have a context where for some op the input tensors can provide such a device context or parametric op whose parameters can provide it as well.
I’ve run the following and the output is CPU. I know that because by default the output tensor will be transferred to CPU. I want to know whether the generation process is conducted on GPU. And it seems that Ops like these are exported directly from cpp.
class _test_gpu_(torch.nn.Module):
def __init__(self):
super(_test_gpu_, self).__init__()
self.f=torch.nn.Linear(4,4)
new_Wight = torch.Tensor(np.ones([4,4]))
self.f.weight = torch.nn.Parameter(new_Wight)
def forward(self,x):
ind=torch.randperm(4)
print(ind.device)
y=self.f(x)
return y[ind]
t= _test_gpu_().cuda()
x=torch.Tensor(np.ones([1,4])).cuda()
res=t(x)
You can specify the desired device
in torch.ramdperm
and check the used CUDA kernel e.g. via:
nsys nvprof python -c "import torch; torch.randperm(10, device='cuda')"
Output:
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ------- ------- ------- ----------------------------------------------------------------------------------------------------
42.1 7,425 1 7,425.0 7,425 7,425 void at::cuda::detail::cub::DeviceRadixSortSingleTileKernel<at::cuda::detail::cub::DeviceRadixSortP…
16.3 2,880 1 2,880.0 2,880 2,880 void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4…
14.3 2,528 1 2,528.0 2,528 2,528 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…
13.8 2,432 1 2,432.0 2,432 2,432 void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(c10::Sca…
13.4 2,368 1 2,368.0 2,368 2,368 void (anonymous namespace)::randperm_handle_duplicate_keys_kernel<int, at::native::(anonymous names…
1 Like