Custom CUDA operator only work well on cuda:0

I’m learning how to write custom cuda operator by your tutorial at GitHub - pytorch/extension-cpp: C++ extensions in PyTorch

However, I found the operator only output correct results on “cuda:0”, and output wrong aresults(all_zeros tensor) in other devices like “cuda:1”

Is there anyway to make the cuda operator work well on

That sounds surprising; could you share some more details about the setup e.g., are cuda:0 and cuda:1 identical devices? What happens if cuda:1 used as cuda:0 e.g., with CUDA_VISIBLE_DEVICES=1?

Interestingly, two methods that set cuda device lead to different result:
method 1) : set CUDA_VISIBLE_DEVICES=1 before running code x=torch.LongTensor([1,2,3]) ...
method 2): do not set environment variable and run code"cuda:1")

method 1 runs perfectly, while method 2 fails.

It sound as if your custom extension might be missing the deviceGuard usage?

Thanks for your suggestion! Could you please provide more details about how to add deviceGuard? Since I did not found anything related to this method in official tutorial, I guess this could be very helpful to point that out in a new version of tutorial :slight_smile:

In your custom extension you would have to add:

#include <c10/cuda/CUDAGuard.h>

const at::cuda::OptionalCUDAGuard device_guard(device_of(local_tensor));
your code
1 Like

It works! Thank you so much!