You can specify the desired device in torch.ramdperm and check the used CUDA kernel e.g. via:
nsys nvprof python -c "import torch; torch.randperm(10, device='cuda')"
Output:
CUDA Kernel Statistics:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ------- ------- ------- ----------------------------------------------------------------------------------------------------
42.1 7,425 1 7,425.0 7,425 7,425 void at::cuda::detail::cub::DeviceRadixSortSingleTileKernel<at::cuda::detail::cub::DeviceRadixSortP…
16.3 2,880 1 2,880.0 2,880 2,880 void at::native::(anonymous namespace)::distribution_elementwise_grid_stride_kernel<unsigned int, 4…
14.3 2,528 1 2,528.0 2,528 2,528 void at::native::vectorized_elementwise_kernel<4, at::native::FillFunctor<long>, at::detail::Array<…
13.8 2,432 1 2,432.0 2,432 2,432 void (anonymous namespace)::elementwise_kernel_with_index<int, at::native::arange_cuda_out(c10::Sca…
13.4 2,368 1 2,368.0 2,368 2,368 void (anonymous namespace)::randperm_handle_duplicate_keys_kernel<int, at::native::(anonymous names…